This module extends code contained in Coronavirus_Statistics_v004.Rmd to include sourcing of all key functions and parameters. This file includes the latest code for analyzing all-cause death data from CDC Weekly Deaths by Jurisdiction. CDC maintains data on deaths by week, age cohort, and state in the US. Downloaded data are unique by state, epidemiological week, year, age, and type (actual vs. predicted/projected).
These data are known to have a lag between death and reporting, and the CDC back-correct to report deaths at the time the death occurred even if the death is reported in following weeks. This means totals for recent weeks tend to run low (lag), and the CDC run a projection of the expected total number of deaths given the historical lag times. Per other analysts on the internet, there is currently significant supra-lag, with lag times much longer than historical averages causing CDC projected deaths for recent weeks to be low.
The code leverages tidyverse and sourced functions throughout:
# All functions assume that tidyverse and its components are loaded and available
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# If the same function is in both files, use the version from the more specific source
source("./Generic_Added_Utility_Functions_202105_v001.R")
source("./Coronavirus_CDC_Excess_Functions_v001.R")
The main function is readRunCDCAllCause(), which performs multiple tasks:
STEP 0: Optionally, downloads the latest data file from CDC STEP 1: Reads and processes a data file has been downloaded from CDC to local
STEP 2: Extract relevant data from a processed state-level COVID Tracking Project list
STEP 3: Basic plots of the CDC data
STEP 4: Basic excess-deaths analysis
STEP 5: Create cluster-level aggregate plots
STEP 6: Create state-level aggregate plots
STEP 7: Create age-cohort aggregate plots
STEP 8: Returns a list of key data frames, modeling objects, named cluster vectors, etc.
The functions are tested on previously downloaded data:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210623.csv"
cdcList_20210703 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=17,
lst=readFromRDS("cdc_daily_210528"),
dlData=FALSE,
stateNoCheck=c("NC"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-05-01
##
##
## *** Data suppression checks ***
## # A tibble: 2 x 6
## noCheck state problem curWeek n deaths
## <lgl> <chr> <lgl> <lgl> <int> <dbl>
## 1 TRUE NC TRUE FALSE 72 NA
## 2 TRUE NC TRUE TRUE 6 NA
## # A tibble: 2 x 3
## noCheck curWeek n
## <lgl> <lgl> <int>
## 1 TRUE FALSE 72
## 2 TRUE TRUE 6
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 91,537
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10735 0 369164
## 2 25-44 years 13656 0 902390
## 3 45-64 years 16793 0 3549786
## 4 65-74 years 16783 0 3558139
## 5 75-84 years 16790 0 4401133
## 6 85 years and older 16780 0 5681860
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691180
## 2 2015-2019 2016 Predicted (weighted) 14445 0 2723236
## 3 2015-2019 2017 Predicted (weighted) 14404 0 2801986
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830372
## 5 2015-2019 2019 Predicted (weighted) 14415 0 2844025
## 6 2020 2020 Predicted (weighted) 14837 0 3433405
## 7 2021 2021 Predicted (weighted) 4672 0 1138268
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72028 0 13890799
## 2 2020 <NA> 14837 0 3433405
## 3 2021 <NA> 4672 0 1138268
##
##
## Checking variable combination: period Note
## # A tibble: 9 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72028 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only ~ 13194 0 2.96e6
## 3 2020 Data in recent weeks are incomplete. Only ~ 531 0 2.31e5
## 4 2020 Weighted numbers of deaths are 20% or more~ 280 0 6.00e4
## 5 2020 Weights may be too low to account for unde~ 18 0 9.85e3
## 6 2020 <NA> 814 0 1.69e5
## 7 2021 Data in recent weeks are incomplete. Only ~ 4469 0 1.10e6
## 8 2021 Data in recent weeks are incomplete. Only ~ 14 0 9.65e2
## 9 2021 Data in recent weeks are incomplete. Only ~ 189 0 3.58e4
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w17.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w17.pdf
##
## Returning plot outputs to the main log file
The latest data are downloaded and processed:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210708.csv"
cdcList_20210708 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=22,
lst=readFromRDS("cdc_daily_210708"),
stateNoCheck=c("NC", "AK", "WV"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-06-05
##
##
## *** Data suppression checks ***
## # A tibble: 4 x 6
## noCheck state problem curWeek n deaths
## <lgl> <chr> <lgl> <lgl> <int> <dbl>
## 1 TRUE AK TRUE FALSE 2 NA
## 2 TRUE NC TRUE FALSE 102 NA
## 3 TRUE NC TRUE TRUE 6 NA
## 4 TRUE WV TRUE TRUE 2 NA
## # A tibble: 2 x 3
## noCheck curWeek n
## <lgl> <lgl> <int>
## 1 TRUE FALSE 104
## 2 TRUE TRUE 8
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 92,880
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10890 0 374959
## 2 25-44 years 13868 0 919211
## 3 45-64 years 17038 0 3605423
## 4 65-74 years 17027 0 3615820
## 5 75-84 years 17033 0 4467166
## 6 85 years and older 17024 0 5757892
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691176
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14408 0 2802027
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830376
## 5 2015-2019 2019 Predicted (weighted) 14414 0 2844003
## 6 2020 2020 Predicted (weighted) 14838 0 3432903
## 7 2021 2021 Predicted (weighted) 6013 0 1416773
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72029 0 13890795
## 2 2020 <NA> 14838 0 3432903
## 3 2021 <NA> 6013 0 1416773
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72029 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 13459 0 3.04e6
## 3 2020 Data in recent weeks are incomplete. Only~ 5 0 1.24e2
## 4 2020 Data in recent weeks are incomplete. Only~ 262 0 1.57e5
## 5 2020 Weighted numbers of deaths are 20% or mor~ 280 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 10 0 5.95e3
## 7 2020 <NA> 822 0 1.73e5
## 8 2021 Data in recent weeks are incomplete. Only~ 5631 0 1.34e6
## 9 2021 Data in recent weeks are incomplete. Only~ 24 0 2.00e3
## 10 2021 Data in recent weeks are incomplete. Only~ 358 0 7.15e4
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w22.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w22.pdf
##
## Returning plot outputs to the main log file
saveToRDS(cdcList_20210708)
The function readProcessCDC() is updated to allow for more control in zeroing out (rather than erroring) where there is a small number of data suppression:
# Function to check for CDC excess suppression
checkCDCSuppression <- function(df, stateNoCheck, errTotAllowed=20, errMaxAllowed=round(errTotAllowed/2)) {
# Categorize the potential issues in the file (note to suppress or NA deaths)
checkProblems <- df %>%
mutate(problem=(!is.na(Suppress) | is.na(deaths)),
noCheck=state %in% all_of(stateNoCheck)
)
# Print a list of the problems, excluding those in stateNoCheck
cat("\nRows in states to be checked that have NA deaths or a note for suppression:\n")
checkProblems %>%
filter(problem, !noCheck) %>%
arrange(desc(year), desc(week)) %>%
select(state, weekEnding, year, week, age, Suppress, deaths) %>%
as.data.frame() %>%
print()
# Summarize the problems
cat("\n\nProblems by state:\n")
checkProblems %>%
group_by(noCheck, state, problem) %>%
summarize(n=n(), deaths=specNA(sum)(deaths), .groups="drop") %>%
filter(problem) %>%
print()
# Assess the amount of error
errorState <- checkProblems %>%
filter(problem, !noCheck) %>%
count(state)
# Error out if threshold for error by state OR total errors exceeded
errMax <- errorState %>% pull(n) %>% max()
errTot <- errorState %>% pull(n) %>% sum()
cat("\n\nThere are", errTot, "rows with errors; maximum for any given state is", errMax, "errors\n")
if ((errTot > errTotAllowed) | (errMax > errMaxAllowed)) {
stop("\nToo many errors; thresholds are ", errTotAllowed, " total and ", errMaxAllowed, " maximum\n")
}
}
plotQCReadProcessCDC <- function(df,
ckCombos=list(c("age"), c("period", "year", "Type"),
c("period", "Suppress"), c("period", "Note")
)
) {
# Create dataset for analysis
df <- df %>%
mutate(n=1, n_deaths_na=ifelse(is.na(deaths), 1, 0))
# Check control totals by specified combinaions
purrr::walk(ckCombos, .f=function(x) {
cat("\n\nChecking variable combination:", x, "\n")
checkControl(df, groupBy=x, useVars=c("n", "n_deaths_na", "deaths"), fn=specNA(sum))
}
)
# Plot deaths by state
p1 <- checkControl(df,
groupBy=c("state"),
useVars=c("deaths"),
fn=specNA(sum),
printControls=FALSE,
pivotData=FALSE
) %>%
ggplot(aes(x=fct_reorder(state, deaths), y=deaths)) +
geom_col(fill="lightblue") +
geom_text(aes(y=deaths, label=paste0(round(deaths/1000), "k")), hjust=0, size=3) +
coord_flip() +
labs(y="Total deaths", x=NULL, title="Total deaths by state in all years in processed file")
print(p1)
# Plot deaths by week/year
p2 <- checkControl(df,
groupBy=c("year", "week"),
useVars=c("deaths"),
fn=specNA(sum),
printControls=FALSE,
pivotData=FALSE
) %>%
ggplot(aes(x=week, y=deaths)) +
geom_line(aes(group=year, color=year)) +
labs(title="Deaths by year and epidemiological week", x="Epi week", y="US deaths") +
scale_color_discrete("Year") +
lims(y=c(0, NA))
print(p2)
}
# Function to read and process raw CDC all-cause deaths data
readProcessCDC <- function(fName,
weekThru,
periodKeep=cdcExcessParams$periodKeep,
fDir="./RInputFiles/Coronavirus/",
col_types=cdcExcessParams$colTypes,
renameVars=cdcExcessParams$remapVars,
maxSuppressAllowed=20,
stateNoCheck=c()
) {
# FUNCTION ARGUMENTS:
# fName: name of the downloaded CDC data file
# weekThru: any record where week is less than or equal to weekThru will be kept
# periodKeep: any record where period is in periodKeep will be kept
# fDir: directory name for the downloaded CDC data file
# col_types: variable type by column in the CDC data (passed to readr::read_csv())
# renameVars: named vector for variable renaming of type c("Existing Name"="New Name")
# maxSuppressAllowed: maximum number of data suppressions (must be in current week/year) to avoid error
# stateNoCheck: vector of states that do NOT have suppression errors thrown
# STEP 1: Read the CSV data
cdcRaw <- fileRead(paste0(fDir, fName), col_types=col_types)
# glimpse(cdcRaw)
# STEP 2: Rename the variables for easier interpretation
cdcRenamed <- cdcRaw %>%
colRenamer(vecRename=renameVars) %>%
colMutater(selfList=list("weekEnding"=lubridate::mdy))
# glimpse(cdcRenamed)
# STEP 3: Convert to factored data
cdcFactored <- cdcRenamed %>%
colMutater(selfList=list("age"=factor), levels=cdcExcessParams$ageLevels) %>%
colMutater(selfList=list("period"=factor), levels=cdcExcessParams$periodLevels) %>%
colMutater(selfList=list("year"=factor), levels=cdcExcessParams$yearLevels)
# glimpse(cdcFactored)
# STEP 4: Filter the data to include only weighted deaths and only through the desired time period
cdcFiltered <- cdcFactored %>%
rowFilter(lstFilter=list("Type"="Predicted (weighted)")) %>%
filter(period %in% all_of(periodKeep) | week <= weekThru)
# glimpse(cdcFiltered)
# STEP 4a: Check that all suppressed data and NA deaths have been eliminated
cat("\n\n *** Data suppression checks *** \n")
checkCDCSuppression(cdcFiltered, stateNoCheck=stateNoCheck, errTotAllowed=maxSuppressAllowed)
cat("\n\nData suppression checks passed\n\n")
# STEP 5: Remove any NA death fields, delete the US record, convert YC to be part of NY
cdcProcessed <- cdcFiltered %>%
rowFilter(lstExclude=list("state"=c("US", "PR"), "deaths"=c(NA))) %>%
mutate(state=ifelse(state=="YC", "NY", state),
fullState=ifelse(state %in% c("NY", "YC"), "New York State (NY plus YC)", fullState)
) %>%
group_by(fullState, weekEnding, state, year, week, age, period, Type, Suppress) %>%
arrange(!is.na(Note)) %>%
summarize(n=n(), deaths=sum(deaths), Note=first(Note), .groups="drop") %>%
ungroup() %>%
checkUniqueRows(uniqueBy=c("state", "year", "week", "age"))
glimpse(cdcProcessed)
# STEP 5a: Check control levels for key variables in processed file
cat("\nCheck Control Levels and Record Counts for Processed Data:\n")
plotQCReadProcessCDC(cdcProcessed)
# STEP 6: Return the processed data file
cdcProcessed
}
The data are processed using the updated function:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210708.csv"
cdcList_20210708_v2 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=23,
lst=readFromRDS("cdc_daily_210708"),
stateNoCheck=c("NC"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-06-12
##
##
## *** Data suppression checks ***
##
## Rows in states to be checked that have NA deaths or a note for suppression:
## state weekEnding year week age
## 1 CT 2021-06-12 2021 23 45-64 years
## 2 CT 2021-06-12 2021 23 65-74 years
## 3 CT 2021-06-12 2021 23 75-84 years
## 4 CT 2021-06-12 2021 23 85 years and older
## 5 DE 2021-06-12 2021 23 65-74 years
## 6 DE 2021-06-12 2021 23 75-84 years
## 7 DE 2021-06-12 2021 23 85 years and older
## 8 WV 2021-06-05 2021 22 45-64 years
## 9 WV 2021-06-05 2021 22 65-74 years
## 10 AK 2021-05-08 2021 18 45-64 years
## 11 AK 2021-05-08 2021 18 65-74 years
## Suppress deaths
## 1 Suppressed (counts highly incomplete, <50% of expected) NA
## 2 Suppressed (counts highly incomplete, <50% of expected) NA
## 3 Suppressed (counts highly incomplete, <50% of expected) NA
## 4 Suppressed (counts highly incomplete, <50% of expected) NA
## 5 Suppressed (counts highly incomplete, <50% of expected) NA
## 6 Suppressed (counts highly incomplete, <50% of expected) NA
## 7 Suppressed (counts highly incomplete, <50% of expected) NA
## 8 Suppressed (counts highly incomplete, <50% of expected) NA
## 9 Suppressed (counts highly incomplete, <50% of expected) NA
## 10 Suppressed (counts highly incomplete, <50% of expected) NA
## 11 Suppressed (counts highly incomplete, <50% of expected) NA
##
##
## Problems by state:
## # A tibble: 5 x 5
## noCheck state problem n deaths
## <lgl> <chr> <lgl> <int> <dbl>
## 1 FALSE AK TRUE 2 NA
## 2 FALSE CT TRUE 4 NA
## 3 FALSE DE TRUE 3 NA
## 4 FALSE WV TRUE 2 NA
## 5 TRUE NC TRUE 114 NA
##
##
## There are 11 rows with errors; maximum for any given state is 4 errors
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 93,132
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10919 0 375951
## 2 25-44 years 13908 0 922283
## 3 45-64 years 17084 0 3615594
## 4 65-74 years 17072 0 3626546
## 5 75-84 years 17079 0 4479686
## 6 85 years and older 17070 0 5772387
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691176
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14408 0 2802027
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830376
## 5 2015-2019 2019 Predicted (weighted) 14414 0 2844003
## 6 2020 2020 Predicted (weighted) 14838 0 3432903
## 7 2021 2021 Predicted (weighted) 6265 0 1468749
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72029 0 13890795
## 2 2020 <NA> 14838 0 3432903
## 3 2021 <NA> 6265 0 1468749
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72029 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 13459 0 3.04e6
## 3 2020 Data in recent weeks are incomplete. Only~ 5 0 1.24e2
## 4 2020 Data in recent weeks are incomplete. Only~ 262 0 1.57e5
## 5 2020 Weighted numbers of deaths are 20% or mor~ 280 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 10 0 5.95e3
## 7 2020 <NA> 822 0 1.73e5
## 8 2021 Data in recent weeks are incomplete. Only~ 5822 0 1.38e6
## 9 2021 Data in recent weeks are incomplete. Only~ 34 0 3.23e3
## 10 2021 Data in recent weeks are incomplete. Only~ 409 0 8.16e4
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w23.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w23.pdf
##
## Returning plot outputs to the main log file
The latest data are downloaded and processed:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210823.csv"
cdcList_20210823 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=29,
lst=readFromRDS("cdc_daily_210815"),
stateNoCheck=c("NC", "AK", "CT"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-07-24
##
##
## *** Data suppression checks ***
##
## Rows in states to be checked that have NA deaths or a note for suppression:
## state weekEnding year week age
## 1 NE 2021-07-24 2021 29 45-64 years
## 2 NE 2021-07-24 2021 29 65-74 years
## 3 NE 2021-07-24 2021 29 75-84 years
## 4 NE 2021-07-24 2021 29 85 years and older
## Suppress deaths
## 1 Suppressed (counts highly incomplete, <50% of expected) NA
## 2 Suppressed (counts highly incomplete, <50% of expected) NA
## 3 Suppressed (counts highly incomplete, <50% of expected) NA
## 4 Suppressed (counts highly incomplete, <50% of expected) NA
##
##
## Problems by state:
## # A tibble: 4 x 5
## noCheck state problem n deaths
## <lgl> <chr> <lgl> <int> <dbl>
## 1 FALSE NE TRUE 4 NA
## 2 TRUE AK TRUE 2 NA
## 3 TRUE CT TRUE 2 NA
## 4 TRUE NC TRUE 120 NA
##
##
## There are 4 rows with errors; maximum for any given state is 4 errors
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 94,758
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 11107 0 383113
## 2 25-44 years 14165 0 943695
## 3 45-64 years 17377 0 3682738
## 4 65-74 years 17367 0 3696383
## 5 75-84 years 17375 0 4559955
## 6 85 years and older 17367 0 5864442
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691178
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14406 0 2802009
## 4 2015-2019 2018 Predicted (weighted) 14398 0 2830356
## 5 2015-2019 2019 Predicted (weighted) 14414 0 2844020
## 6 2020 2020 Predicted (weighted) 14835 0 3432937
## 7 2021 2021 Predicted (weighted) 7898 0 1806613
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72025 0 13890776
## 2 2020 <NA> 14835 0 3432937
## 3 2021 <NA> 7898 0 1806613
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72025 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 13494 0 3.05e6
## 3 2020 Data in recent weeks are incomplete. Only~ 4 0 1.17e2
## 4 2020 Data in recent weeks are incomplete. Only~ 225 0 1.47e5
## 5 2020 Weighted numbers of deaths are 20% or mor~ 280 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 10 0 5.96e3
## 7 2020 <NA> 822 0 1.73e5
## 8 2021 Data in recent weeks are incomplete. Only~ 7250 0 1.63e6
## 9 2021 Data in recent weeks are incomplete. Only~ 18 0 5.3 e2
## 10 2021 Data in recent weeks are incomplete. Only~ 630 0 1.74e5
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w29.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w29.pdf
##
## Returning plot outputs to the main log file
CDC data for deaths by age and location available at CDC website are downloaded, cached to avoid multiple hits to the server:
deathAgeLoc <- "./RInputFiles/Coronavirus/COvID_deaths_age_place_20210824.csv"
if (!file.exists(deathAgeLoc)) {
fileDownload(fileName="./RInputFiles/Coronavirus/COvID_deaths_age_place_20210824.csv",
url="https://data.cdc.gov/api/views/4va6-ph5s/rows.csv?accessType=DOWNLOAD"
)
} else {
cat("\nFile already exists, not downloading\n")
}
##
## File already exists, not downloading
The file is then read for a basic exploration:
deathAge_20210824_raw <- fileRead(deathAgeLoc, col_types="cccciiccccddddddc")
glimpse(deathAge_20210824_raw)
## Rows: 100,602
## Columns: 17
## $ `Data as of` <chr> "08/18/2021", "08/18/2021",~
## $ `Start Date` <chr> "01/01/2020", "01/01/2020",~
## $ `End Date` <chr> "08/14/2021", "08/14/2021",~
## $ Group <chr> "By Total", "By Total", "By~
## $ Year <int> NA, NA, NA, NA, NA, NA, NA,~
## $ Month <int> NA, NA, NA, NA, NA, NA, NA,~
## $ `HHS Region` <chr> "0", "0", "0", "0", "0", "0~
## $ State <chr> "United States", "United St~
## $ `Place of Death` <chr> "Total - All Places of Deat~
## $ `Age group` <chr> "All Ages", "0-17 years", "~
## $ `COVID-19 Deaths` <dbl> 614530, 361, 2630, 7501, 19~
## $ `Total Deaths` <dbl> 5296490, 53192, 100227, 143~
## $ `Pneumonia Deaths` <dbl> 557008, 865, 2814, 6900, 17~
## $ `Pneumonia and COVID-19 Deaths` <dbl> 303039, 73, 1163, 3498, 986~
## $ `Influenza Deaths` <dbl> 9232, 188, 148, 323, 501, 2~
## $ `Pneumonia, Influenza, or COVID-19 Deaths` <dbl> 876434, 1341, 4417, 11201, ~
## $ Footnote <chr> NA, NA, NA, NA, NA, NA, NA,~
deathAge_20210824_conv <- deathAge_20210824_raw %>%
colRenamer(vecRename=c("Data as of"="asofDate",
"Start Date"="startDate",
"End Date"="endDate",
"HHS Region"="HHSRegion",
"Place of Death"="deathPlace",
"Age group"="Age",
"COVID-19 Deaths"="covidDeaths",
"Total Deaths"="totalDeaths",
"Pneumonia Deaths"="pneumoDeaths",
"Pneumonia and COVID-19 Deaths"="pneumoCovidDeaths",
"Influenza Deaths"="fluDeaths",
"Pneumonia, Influenza, or COVID-19 Deaths"="pnemoFluCovidDeaths"
)
) %>%
colMutater(selfList=list("asofDate"=lubridate::mdy, "startDate"=lubridate::mdy, "endDate"=lubridate::mdy))
glimpse(deathAge_20210824_conv)
## Rows: 100,602
## Columns: 17
## $ asofDate <date> 2021-08-18, 2021-08-18, 2021-08-18, 2021-08-18, 2~
## $ startDate <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2~
## $ endDate <date> 2021-08-14, 2021-08-14, 2021-08-14, 2021-08-14, 2~
## $ Group <chr> "By Total", "By Total", "By Total", "By Total", "B~
## $ Year <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ Month <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ HHSRegion <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", ~
## $ State <chr> "United States", "United States", "United States",~
## $ deathPlace <chr> "Total - All Places of Death", "Total - All Places~
## $ Age <chr> "All Ages", "0-17 years", "18-29 years", "30-39 ye~
## $ covidDeaths <dbl> 614530, 361, 2630, 7501, 19776, 98973, 137149, 167~
## $ totalDeaths <dbl> 5296490, 53192, 100227, 143051, 212953, 881095, 10~
## $ pneumoDeaths <dbl> 557008, 865, 2814, 6900, 17026, 92781, 130216, 154~
## $ pneumoCovidDeaths <dbl> 303039, 73, 1163, 3498, 9861, 52942, 74134, 85579,~
## $ fluDeaths <dbl> 9232, 188, 148, 323, 501, 2191, 1997, 2003, 1881, ~
## $ pnemoFluCovidDeaths <dbl> 876434, 1341, 4417, 11201, 27371, 140656, 194900, ~
## $ Footnote <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
# Combinations of startDate and endDate
deathAge_20210824_conv %>%
count(asofDate, startDate, endDate) %>%
ggplot(aes(y=startDate, x=endDate)) +
geom_point(aes(size=n)) +
facet_wrap(~asofDate) +
labs(x="Ending Date", y="Starting Date", title="Combinations of Start and End Date")
deathAge_20210824_conv %>%
count(Group, deathPlace, Age) %>%
ggplot(aes(x=Group, y=deathPlace)) +
geom_tile(aes(fill=n)) +
facet_wrap(~Age) +
labs(x="Group", y="Place of Death", title="Combinations of Age, Place of Death, and Group")
deathState <- deathAge_20210824_conv %>%
filter(Group=="By Total", deathPlace=="Total - All Places of Death", Age=="All Ages") %>%
group_by(State) %>%
summarize(across(where(is.numeric), sum, na.rm=TRUE)) %>%
mutate(abb=state.abb[match(State, state.name)])
deathState %>% filter(is.na(abb))
## # A tibble: 4 x 10
## State Year Month covidDeaths totalDeaths pneumoDeaths pneumoCovidDeat~
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 District of~ 0 0 1501 11580 1847 1228
## 2 New York Ci~ 0 0 29547 121838 17718 11098
## 3 Puerto Rico 0 0 2567 49898 6883 1823
## 4 United Stat~ 0 0 614530 5296490 557008 303039
## # ... with 3 more variables: fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # abb <chr>
deathBase <- deathState %>%
select(State, covidDeaths, totalDeaths) %>%
mutate(noncovid=covidDeaths/totalDeaths) %>%
filter(!(State %in% c("United States", "Puerto Rico"))) %>%
pivot_longer(-c(State)) %>%
ggplot(aes(x=fct_reorder(State, value, max), y=value/1000)) +
coord_flip() +
theme(legend.position="bottom")
deathBase +
geom_col(data=~filter(., name=="totalDeaths"), aes(fill="All")) +
geom_col(data=~filter(., name=="covidDeaths"), aes(fill="COVID")) +
scale_fill_manual("Type", breaks=c("COVID", "All"), labels=c("COVID", "All"), values=c("red", "black")) +
labs(title="Deaths 2020-present by state", x=NULL, y="Deaths (000s)")
deathBase +
geom_col(data=~filter(., name=="noncovid"), aes(y=value), position="identity") +
labs(x=NULL, y=NULL, title="Proportion of deaths from COVID")
The data appear to contain monthly totals, with the addition of full-year 2020, YTD 2021, and total 2020-YTD 2021. Totals are provided by age sub-group and overall, place of death category and overall, and monthly, annually, and total.
Total deaths and proportions from COVID appear sensible. Next steps are to continue processing and exploring the data:
# Add the state abbreviation
deathAge_20210824_conv <- deathAge_20210824_conv %>%
mutate(abb=c(state.abb, "DC")[match(State, c(state.name, "District of Columbia"))])
# Function to check that totals match sum of sub-totals
checkSubTotals <- function(df, checkByVars, subVar, subVarTotal, sumVars=NULL, sumFunc=specNA(sum), ...) {
# FUNCTION ARGUMENTS:
# df: data.frame or tibble
# checkByVars: variables that the frame will be checked by
# subVar: variable that is being checked
# subVarTotal: label for the value that is the total of subVar
# sumVars: variables to be summed (NULL means all numeric)
# sumFunc: function to be applied when summing all variables
# ...: any other arguments to pass to summarize(across(all_of(checkByVars), .fns=sumFunc, ...))
# If sumVars is NULL, find the sum variables
if (is.null(sumVars)) sumVars <- df %>% head(1) %>% select_if(is.numeric) %>% names()
# Keep only te desired variables in df
df <- df %>%
select(all_of(c(checkByVars, subVar, sumVars))) %>%
arrange(across(all_of(checkByVars)))
# Split the data frame by subtotal and total
dfTot <- df %>%
filter(get(subVar) == subVarTotal)
dfSub <- df %>%
filter(get(subVar) != subVarTotal) %>%
group_by(across(all_of(checkByVars))) %>%
summarize(across(all_of(sumVars), .fns=sumFunc, ...), .groups="drop") %>%
mutate(fakeCol=subVarTotal) %>%
colRenamer(vecRename=c("fakeCol"=subVar)) %>%
select(names(dfTot))
# Comparison of totals
list(dfSub=dfSub, dfTot=dfTot)
}
checkNumbers <- function(lst, byVars, lstNames=NULL, absTol=100, pctTol=0.05, keyVar="key variable") {
# FUNCTION ARGUMENTS:
# lst: a list with two items that will be checked for similarity
# byVars: by variables that should be identical across the list items
# lstNames: names to use for the list (NULL means use names provided in lst)
# absTol: absolute value of differences to flag
# pctTol: percent tolerance for differences to flag
# keyVar: name for the key variable in plot title
# Check that lst is a list of length 2
if (!("list" %in% class(lst)) | !(length(lst)==2)) stop("\nMust pass a list with two items\n")
# Add names if passed in lstNames, otherwise use names(lst)
if (!is.null(lstNames)) names(lst) <- lstNames
else lstNames <- names(lst)
# Check for identical files using only byVars
if (!isTRUE(identical(lst[[1]][, byVars], lst[[2]][, byVars])))
stop("\nSub-lists differ by byVars, not comparing\n")
else cat("\nSub-lists are identical by:", paste0(byVars, collapse=", "), "\n")
# Check the numeric values
dfDelta <- lapply(lst, FUN=function(x) pivot_longer(x, cols=-all_of(byVars)) %>%
mutate(value=ifelse(is.na(value), 0, value)) %>%
select(all_of(byVars), name, value)
) %>%
purrr::reduce(.f=inner_join, by=c(all_of(byVars), "name")) %>%
mutate(delta=value.x-value.y, pct=ifelse(delta==0, 0, delta/(value.x+value.y))) %>%
purrr::set_names(c(all_of(byVars), "name", all_of(lstNames), "delta", "pct"))
# Plot the differences using name as facet
p1 <- dfDelta %>%
ggplot(aes(x=delta, y=pct)) +
geom_point() +
facet_wrap(~name, scales="free") +
labs(title=paste0("Differences between totals and subtotals on variable: ", keyVar),
x="Difference between total and subtotal",
y="Percentage difference"
)
print(p1)
# Flag significant outliers
dfDelta %>%
filter(abs(delta) >= absTol, abs(pct) >= pctTol) %>%
arrange(-abs(delta)) %>%
print()
}
# Get a list of the possible variables
allCheckVars <- names(deathAge_20210824_conv) %>%
setdiff(deathAge_20210824_conv %>% head(1) %>% select_if(is.numeric) %>% names()) %>%
setdiff(c("Footnote", "abb", "HHSRegion"))
# Test for each variable in allCheckVars
subMap <- c("State"="United States", "Age"="All Ages", "deathPlace"="Total - All Places of Death")
lapply(c("State", "deathPlace", "Age"),
FUN=function(x) deathAge_20210824_conv %>%
select(-Year, -Month) %>%
checkSubTotals(checkByVars=allCheckVars %>% setdiff(x), subVar=x, subVarTotal=unname(subMap[x])) %>%
checkNumbers(byVars=allCheckVars, keyVar=x)
)
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 1,118 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-08-18 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-08-18 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 116
## 5 2021-08-18 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Decedent's~ 65-7~ pneum~ 143
## # ... with 1,108 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
## [[1]]
## # A tibble: 1,118 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-08-18 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-08-18 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 116
## 5 2021-08-18 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Decedent's~ 65-7~ pneum~ 143
## # ... with 1,108 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## [[2]]
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## [[3]]
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
Variables Age and deathPlace appear to be well-aligned between sub-totals and totals, while variable State shows some more significant differences. Next steps are to further research what is contained in State, including alignment to other data sources.
Deaths by state are compared between files, using July 31, 2021 as the cutoff:
# Create summary by state and year-month
death_sum_210824 <- deathAge_20210824_conv %>%
filter(!is.na(Year), !is.na(Month), deathPlace=="Total - All Places of Death", Age=="All Ages") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month))),
abb=c(state.abb, "DC", "US")[match(State, c(state.name, "District of Columbia", "United States"))]
) %>%
select(State, abb, ym, where(is.numeric), -Year, -Month) %>%
pivot_longer(-c(State, abb, ym)) %>%
arrange(State, abb, name, ym) %>%
group_by(State, abb, name) %>%
mutate(cumValue=cumsum(ifelse(is.na(value), 0, value))) %>%
ungroup() %>%
mutate(date=lubridate::ceiling_date(ym, unit="month")-lubridate::days(1))
# Create summary from state-level file
death_daily_210815 <- readFromRDS("cdc_daily_210815")$dfPerCapita %>%
select(date, abb=state, tot_deaths) %>%
mutate(Year=lubridate::year(date), Month=lubridate::month(date)) %>%
group_by(Year, Month) %>%
filter(date==max(date)) %>%
ungroup()
# Create a plot for evolution of United States
death_sum_210824 %>%
filter(abb=="US", name=="covidDeaths", ym <= "2021-07-31") %>%
ggplot(aes(x=date)) +
geom_line(aes(y=cumValue/1000, color="blue"), size=2) +
geom_point(data=summarize(group_by(filter(death_daily_210815, date <= "2021-07-31"), date),
tot_deaths=sum(tot_deaths, na.rm=TRUE)
),
aes(y=tot_deaths/1000, color="green"),
size=3
) +
labs(x="End of month", y="Cumulative Deaths (000)", title="Cumulative COVID Deaths (000) in US by source") +
scale_color_manual("Source", labels=c("Summed\nstates", "Summed\nsubtotals"), values=c("green", "blue"))
Cumulative deaths by month for total US appear consistent across the files. Next steps are to continue exploring for state-level data:
# Create a plot for total by states
death_sum_210824 %>%
filter(abb %in% c(state.abb, "DC"), name=="covidDeaths", date == "2021-07-31") %>%
ggplot() +
geom_col(aes(x=fct_reorder(abb, cumValue), y=cumValue/1000), fill="lightblue") +
geom_point(data=filter(death_daily_210815, date == "2021-07-31"),
aes(x=abb, y=tot_deaths/1000),
size=3
) +
coord_flip() +
labs(x=NULL,
y="Cumulative Deaths (000)",
title="Cumulative COVID Deaths (000) in US as of 2021-07-31",
subtitle="Filled bars are summed subtotals, points are from CDC daily")
# Same plot using merged data
plot_cum0721 <- death_sum_210824 %>%
filter(abb %in% c(state.abb, "DC"), name=="covidDeaths", date == "2021-07-31") %>%
select(abb, cumValue) %>%
inner_join(select(filter(death_daily_210815, date == "2021-07-31"), abb, tot_deaths), by=c("abb")) %>%
mutate(pctdiff=abs(tot_deaths-cumValue)/(tot_deaths+cumValue))
plot_cum0721 %>%
arrange(-pctdiff)
## # A tibble: 51 x 4
## abb cumValue tot_deaths pctdiff
## <chr> <dbl> <dbl> <dbl>
## 1 NY 25579 53524 0.353
## 2 MA 13713 18082 0.137
## 3 DC 1500 1149 0.133
## 4 NE 2963 2280 0.130
## 5 MO 12003 9667 0.108
## 6 GA 18335 21683 0.0837
## 7 OK 8845 7515 0.0813
## 8 AK 327 382 0.0776
## 9 WY 672 776 0.0718
## 10 ND 1766 1539 0.0687
## # ... with 41 more rows
plot_cum0721 %>%
summarize(across(where(is.numeric), sum))
## # A tibble: 1 x 3
## cumValue tot_deaths pctdiff
## <dbl> <dbl> <dbl>
## 1 581194 609079 2.27
plot_cum0721 %>%
ggplot(aes(x=fct_reorder(abb, cumValue))) +
geom_col(aes(y=cumValue/1000), fill="lightblue") +
geom_point(aes(y=tot_deaths/1000), size=3) +
coord_flip() +
labs(x=NULL,
y="Cumulative Deaths (000)",
title="Cumulative COVID Deaths (000) in US as of 2021-07-31",
subtitle="Filled bars are summed subtotals, points are from CDC daily"
)
The New York City data will need to be added to NY for further analysis. There are some surprising differences in total deaths reported by state, even as total deaths (after adding Nyc) are nearly identical between the files.
Breakdown of deaths by age is also explored:
deathAllData <- deathAge_20210824_conv %>%
filter(deathPlace=="Total - All Places of Death")
deathAllData
## # A tibble: 11,178 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 3 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 4 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 5 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 6 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 7 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 8 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 9 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 10 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 4 Alab~ Total - A~
## # ... with 11,168 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
# Proportions of death by age and cause
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Total") %>%
select(Age, where(is.numeric), -Year, -Month) %>%
pivot_longer(-Age) %>%
ggplot() +
geom_col(aes(x=name, y=value, fill=fct_rev(Age)), position="fill") +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by cause (2020-August 2021)") +
scale_fill_discrete("Age")
# Proportions of death by age and month
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(Age, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(Age, ym)) %>%
ggplot() +
geom_col(aes(x=ym, y=value, fill=fct_rev(Age)), position="fill") +
facet_wrap(~name) +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by age and cause (2020-August 2021)") +
scale_fill_discrete("Age")
# Total death by age and month
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(Age, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(Age, ym)) %>%
filter(ym != "2021-08-01") %>%
ggplot() +
geom_line(aes(x=ym, y=value, color=fct_rev(Age), group=Age)) +
facet_wrap(~name, scales="free_y") +
labs(x=NULL, y="Proportion of Deaths", title="Deaths by age and cause (2020-July 2021)") +
scale_color_discrete("Age")
There are very few reported flu deaths in the 2020-2021 data. The change in covidDeaths by age over time appears to be at most a minor driver of the change in totalDeaths by age over time. This is consistent with covidDeaths being in the 10%-20% range of totalDeaths, distributed by age (to a first order) in a somewhat similar pattern.
A similar process is run for place of death:
deathPlaceData <- deathAge_20210824_conv %>%
filter(Age == "All Ages")
deathPlaceData
## # A tibble: 11,178 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Healthcar~
## 3 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Healthcar~
## 4 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Healthcar~
## 5 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Decedent'~
## 6 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Hospice f~
## 7 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Nursing h~
## 8 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Other
## 9 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Place of ~
## 10 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 4 Alab~ Total - A~
## # ... with 11,168 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
# Proportions of death by place and cause
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Total") %>%
select(deathPlace, where(is.numeric), -Year, -Month) %>%
pivot_longer(-deathPlace) %>%
ggplot() +
coord_flip() +
geom_col(aes(x=name, y=value, fill=fct_rev(deathPlace)), position="fill") +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by place (2020-August 2021)") +
scale_fill_discrete("Death\nPlace") +
theme(legend.position="bottom")
# Proportions of death by place and month
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(deathPlace, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(deathPlace, ym)) %>%
ggplot() +
geom_col(aes(x=ym, y=value, fill=fct_rev(deathPlace)), position="fill") +
facet_wrap(~name) +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by place and cause (2020-August 2021)") +
scale_fill_discrete("Death\nPlace") +
theme(legend.position="bottom")
# Total death by place and month
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(deathPlace, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(deathPlace, ym)) %>%
filter(ym != "2021-08-01") %>%
ggplot() +
geom_line(aes(x=ym, y=value, color=fct_rev(deathPlace), group=deathPlace)) +
facet_wrap(~name, scales="free_y") +
labs(x=NULL, y="Proportion of Deaths", title="Deaths by place and cause (2020-July 2021)") +
scale_color_discrete("Death\nPlace")
Relative to overall deaths, COVID deaths appear more prevalent in the inpatient healthcare setting or nursing home and less prevalent at home. The proportion has moved away from nursing homes and towards inpatient (hospital) as the pandemic progressed.
Exploration of the place of death for COVID and non-COVID deaths is explored:
zeroNA <- function(x) ifelse(is.na(x), 0, x)
# Locations of death by age
tempPlotData <- deathAge_20210824_conv %>%
mutate(nonCovidDeaths=zeroNA(totalDeaths)-zeroNA(covidDeaths)) %>%
select(Group, startDate, endDate, State, deathPlace, Age, where(is.numeric), -Month, -Year) %>%
pivot_longer(where(is.numeric))
# Basic plotting data
p1 <- tempPlotData %>%
filter(name %in% c("covidDeaths", "nonCovidDeaths"),
State=="United States",
Group=="By Total"
) %>%
ggplot(aes(x=Age, y=value/1000)) +
coord_flip() +
scale_fill_discrete("") +
theme(legend.position="bottom") +
labs(x=NULL, y="Deaths (000)", title="United States deaths (2020 thru mid-Aug 2021)")
# Overall deaths by age and type
p1 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(fill=name),
position="stack"
)
# Proportion deaths by age and type
p1 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death"),
aes(fill=fct_rev(name)),
position="fill"
) +
labs(y="Proportion of deaths")
# Overall deaths by age and type and location
p1 +
geom_col(data=~filter(., deathPlace!="Total - All Places of Death", Age != "All Ages"),
aes(fill=name),
position="stack"
) +
facet_wrap(~deathPlace)
# Proportion of deaths by age and type and location
p1 +
geom_col(data=~filter(., Age !="All Ages"),
aes(fill=fct_rev(name)),
position="fill"
) +
facet_wrap(~deathPlace) +
labs(y="Proportion of deaths") +
geom_hline(yintercept=0.25, lty=2)
As seen in other analyses, COVID deaths tend to occur in an older population in the institutional (nursing home or hospital) setting. Further exploration of these trends over time and by location may be interesting.
The evolution by month is also explored:
# Basic plotting data
p2 <- tempPlotData %>%
filter(name %in% c("covidDeaths", "nonCovidDeaths"),
State=="United States",
Group=="By Month",
endDate <= "2021-07-31"
) %>%
ggplot(aes(x=fct_reorder(deathPlace, value, max), y=value/1000)) +
coord_flip() +
scale_fill_discrete("") +
theme(legend.position="bottom") +
labs(x=NULL, y="Deaths (000)", title="United States deaths (2020 thru July 2021)")
# Overall deaths by month and place
p2 +
geom_col(data=~filter(., deathPlace!="Total - All Places of Death", Age =="All Ages"),
aes(fill=name),
position="stack"
) +
facet_wrap(~endDate)
# Proportion of deaths by month and place
p2 +
geom_col(data=~filter(., deathPlace!="Total - All Places of Death", Age =="All Ages"),
aes(fill=name),
position="fill"
) +
facet_wrap(~endDate) +
labs(y="Proportion of deaths")
# Overall deaths by month and age
p2 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(x=Age, fill=name),
position="stack"
) +
facet_wrap(~endDate)
# Proportion of deaths by month and age
p2 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(x=Age, fill=name),
position="fill"
) +
facet_wrap(~endDate) +
labs(y="Proportion of deaths")
COVID deaths are clearly clustered by each of age, month, and place. Older people, late-2020 and early-2021, and inpatient hospital settings have much higher amounts and proportions of covid death
The latest data are downloaded and processed:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210911.csv"
cdcList_20210911 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=32,
lst=readFromRDS("cdc_daily_210902"),
stateNoCheck=c("NC", "WV", "CT"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-08-14
##
##
## *** Data suppression checks ***
##
## Rows in states to be checked that have NA deaths or a note for suppression:
## [1] state weekEnding year week age Suppress deaths
## <0 rows> (or 0-length row.names)
##
##
## Problems by state:
## # A tibble: 3 x 5
## noCheck state problem n deaths
## <lgl> <chr> <lgl> <int> <dbl>
## 1 TRUE CT TRUE 8 NA
## 2 TRUE NC TRUE 90 NA
## 3 TRUE WV TRUE 7 NA
## Warning in max(.): no non-missing arguments to max; returning -Inf
##
##
## There are 0 rows with errors; maximum for any given state is -Inf errors
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 95,671
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 11223 0 387116
## 2 25-44 years 14307 0 957029
## 3 45-64 years 17542 0 3723324
## 4 65-74 years 17532 0 3737178
## 5 75-84 years 17538 0 4605961
## 6 85 years and older 17529 0 5916037
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14363 0 2691167
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14406 0 2802007
## 4 2015-2019 2018 Predicted (weighted) 14399 0 2830363
## 5 2015-2019 2019 Predicted (weighted) 14416 0 2844031
## 6 2020 2020 Predicted (weighted) 14836 0 3432959
## 7 2021 2021 Predicted (weighted) 8808 0 2002905
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72027 0 13890781
## 2 2020 <NA> 14836 0 3432959
## 3 2021 <NA> 8808 0 2002905
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72027 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 13486 0 3.04e6
## 3 2020 Data in recent weeks are incomplete. Only~ 5 0 1.24e2
## 4 2020 Data in recent weeks are incomplete. Only~ 233 0 1.50e5
## 5 2020 Weighted numbers of deaths are 20% or mor~ 280 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 12 0 8.31e3
## 7 2020 <NA> 820 0 1.70e5
## 8 2021 Data in recent weeks are incomplete. Only~ 8072 0 1.75e6
## 9 2021 Data in recent weeks are incomplete. Only~ 69 0 8.28e3
## 10 2021 Data in recent weeks are incomplete. Only~ 667 0 2.47e5
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w32.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w32.pdf
##
## Returning plot outputs to the main log file
saveToRDS(cdcList_20210911)
Plots are created for the deaths by week and age group, with the capacity to limit to a subset:
# Function to get the regression from a plot
tempGetReg <- function(df, regYears=2015:2019) {
lm(deaths ~ age + weekEnding:age + 0, data=subset(df, year %in% all_of(regYears))) %>%
broom::tidy() %>%
mutate(age=factor(str_remove(str_remove(term, pattern=":.*"), pattern="age")),
type=ifelse(str_detect(term, pattern=":"), "slope", "intercept")
) %>%
select(age, name=type, value=estimate) %>%
pivot_wider(age)
}
# Function to plot differences from trend
getDiffTrend <- function(df,
regYears,
plotTitle,
returnData=FALSE
) {
# FUNCTION ARGUMENTS:
# df: data frame or tibble with filtered data
# regYears: years to be used in regression
# plotTitle: title for the plot
# returnData: boolean, should the data be returned?
# Run the linear mode and make the predictions
tempLM <- lm(deaths ~ age + weekEnding:age + 0, data=subset(df, year %in% all_of(regYears)))
tempDF <- df %>%
mutate(pred=predict(tempLM, newdata=df))
# Summarize to year-age
sumDF <- tempDF %>%
group_by(year, age) %>%
summarize(across(where(is.numeric), sum), .groups="drop")
# Plot the excess by year and age group
p2 <- sumDF %>%
ggplot(aes(x=year, y=deaths-pred)) +
geom_col(aes(fill=year)) +
geom_text(data=~filter(., !(year %in% all_of(regYears))),
aes(y=0.5*(deaths-pred), label=round(deaths-pred))
) +
facet_wrap(~age, scales="free_y") +
labs(x="Year",
y="Actual deaths vs. predicted",
title=plotTitle,
subtitle=paste0("Trend line is linear model without seasonailty using ",
paste0(regYears, collapse="-"),
" data"
)
)
# Return the requested data
if(returnData) list(objPlot=p2, fullDF=tempDF, sumDF=sumDF)
else p2
}
# Function to create plots
plotAgeWeekDeath <- function(lst,
keyStates=NULL,
addLM=TRUE,
lmYears=2015:2019,
diffTrend=FALSE,
printPlots=TRUE,
returnData=FALSE,
returnPlots=!isTRUE(printPlots)
) {
# FUNCTION ARGUMENTS:
# lst: a processed list file that includes the CDC deaths data
# keyStates: states to be included
# addLM: boolean, should a line for the linear model be added?
# lmYears: what years should the linear model be fitted against?
# diffTrend: boolean, should a difference from trend by year be calculated?
# printPlots: boolean, should the plots be printed?
# returnData: boolean, should the data be returned?
# returnPlots: boolean, should the plot objects be returned?
# Create the plot title (use 50 states plus DC if keyStates passed as NULL)
plotTitle <- paste0("Total deaths by age cohort and week (",
if (is.null(keyStates)) "50 states plus DC" else paste0(keyStates, collapse=", "),
")"
)
# Set the keyStates field to 50 states plus DC if not provided
if (is.null(keyStates)) keyStates <- c(state.abb, "DC")
# Extract the CDC data
df <- lst[["cdc"]]
# Create plot data
p1Data <- df %>%
filter(state %in% all_of(keyStates)) %>%
group_by(weekEnding, year, age) %>%
summarize(deaths=sum(deaths), .groups="drop")
# Create the plot
p1 <- p1Data %>%
ggplot(aes(x=weekEnding, y=deaths)) +
geom_line(aes(color=year)) +
facet_wrap(~age, scales="free_y") +
lims(y=c(0, NA)) +
labs(title=plotTitle,
x="Week",
y="All-cause deaths"
) +
# subtitle="Dashed line is linear model without seasonailty using 2015-2019 data",
scale_color_discrete("Year")
# Add the linear model if requested
if (isTRUE(addLM)) {
p1 <- p1 +
geom_abline(data=~tempGetReg(., regYears=lmYears),
aes(slope=slope, intercept=intercept),
lty=2
) +
labs(subtitle=paste0("Dashed line is linear model without seasonailty using ",
paste0(lmYears, collapse="-"),
" data"
)
)
}
# Display the plot if requested
if (printPlots) print(p1)
# Create a dataset for difference from expected deaths by year
if (isTRUE(diffTrend)) {
p2 <- getDiffTrend(p1Data,
regYears=lmYears,
plotTitle=paste0("Deaths vs trend for: ",
if (setequal(keyStates, c(state.abb, "DC")))
"50 states plus DC"
else
paste0(keyStates, collapse=", ")
),
returnData=returnData
)
p2Plot <- if(isTRUE(returnData)) p2$objPlot else p2
if (printPlots) print(p2Plot)
} else {
p2 <- NULL
}
# Return objects if requested
if (isTRUE(returnData) | isTRUE(returnPlots)) {
return(list(p1=if(isTRUE(returnPlots)) p1 else NULL,
p2=if(isTRUE(returnPlots)) p2Plot else NULL,
p1Data=if(isTRUE(returnData)) p1Data else NULL,
p2Full=if(isTRUE(returnData)) p2$fullDF else NULL,
p2Sum=if(isTRUE(returnData)) p2$sumDF else NULL
)
)
}
}
# Example plots
plotAgeWeekDeath(cdcList_20210911, addLM=FALSE)
plotAgeWeekDeath(cdcList_20210911)
plotAgeWeekDeath(cdcList_20210911, keyStates=c("NY", "NJ", "MA", "CT"))
plotAgeWeekDeath(cdcList_20210911, keyStates=c("TX", "LA", "MS", "AL", "FL"))
plotAgeWeekDeath(cdcList_20210911, keyStates=c("ND", "SD", "MT", "IA", "MO"))
# Examples with excess
plotAgeWeekDeath(cdcList_20210911, diffTrend=TRUE)
plotAgeWeekDeath(cdcList_20210911, keyStates=c("NY", "NJ", "MA", "CT"), diffTrend=TRUE)
plotAgeWeekDeath(cdcList_20210911, keyStates=c("TX", "LA", "MS", "AL", "FL"), diffTrend=TRUE)
plotAgeWeekDeath(cdcList_20210911, keyStates=c("ND", "SD", "MT", "IA", "MO"), diffTrend=TRUE)
# Examples with return options
plotAgeWeekDeath(cdcList_20210911, diffTrend=TRUE, printPlots=TRUE, returnData=FALSE, returnPlots=FALSE)
plotAgeWeekDeath(cdcList_20210911, diffTrend=TRUE, printPlots=FALSE, returnData=FALSE, returnPlots=FALSE)
plotAgeWeekDeath(cdcList_20210911, diffTrend=TRUE, printPlots=FALSE, returnData=FALSE, returnPlots=TRUE)
## $p1
##
## $p2
##
## $p1Data
## NULL
##
## $p2Full
## NULL
##
## $p2Sum
## NULL
plotAgeWeekDeath(cdcList_20210911, diffTrend=TRUE, printPlots=FALSE, returnData=TRUE, returnPlots=FALSE)
## $p1
## NULL
##
## $p2
## NULL
##
## $p1Data
## # A tibble: 2,070 x 4
## weekEnding year age deaths
## <date> <fct> <fct> <dbl>
## 1 2015-01-10 2015 Under 25 years 1068
## 2 2015-01-10 2015 25-44 years 2344
## 3 2015-01-10 2015 45-64 years 11354
## 4 2015-01-10 2015 65-74 years 10711
## 5 2015-01-10 2015 75-84 years 14795
## 6 2015-01-10 2015 85 years and older 21412
## 7 2015-01-17 2015 Under 25 years 1102
## 8 2015-01-17 2015 25-44 years 2288
## 9 2015-01-17 2015 45-64 years 11248
## 10 2015-01-17 2015 65-74 years 10552
## # ... with 2,060 more rows
##
## $p2Full
## # A tibble: 2,070 x 5
## weekEnding year age deaths pred
## <date> <fct> <fct> <dbl> <dbl>
## 1 2015-01-10 2015 Under 25 years 1068 1164.
## 2 2015-01-10 2015 25-44 years 2344 2361.
## 3 2015-01-10 2015 45-64 years 11354 10355.
## 4 2015-01-10 2015 65-74 years 10711 9465.
## 5 2015-01-10 2015 75-84 years 14795 12086.
## 6 2015-01-10 2015 85 years and older 21412 16594.
## 7 2015-01-17 2015 Under 25 years 1102 1164.
## 8 2015-01-17 2015 25-44 years 2288 2363.
## 9 2015-01-17 2015 45-64 years 11248 10354.
## 10 2015-01-17 2015 65-74 years 10552 9470.
## # ... with 2,060 more rows
##
## $p2Sum
## # A tibble: 42 x 4
## year age deaths pred
## <fct> <fct> <dbl> <dbl>
## 1 2015 Under 25 years 58450 60036.
## 2 2015 25-44 years 120850 124788.
## 3 2015 45-64 years 530877 538292.
## 4 2015 65-74 years 493068 498992.
## 5 2015 75-84 years 633929 634125.
## 6 2015 85 years and older 853993 863434.
## 7 2016 Under 25 years 60583 58992.
## 8 2016 25-44 years 131849 128886.
## 9 2016 45-64 years 538354 537997.
## 10 2016 65-74 years 509961 512843.
## # ... with 32 more rows
The function can then be used to gather data for each state:
# Create death statistics for each state
tempStateList <- sort(c(state.abb, "DC")) %>%
setdiff(c("WY", "WV", "VT", "SD", "RI", "NH", "ND", "MT", "ME", "ID", "HI", "DE", "DC", "AK"))
tempState <- lapply(tempStateList,
FUN=function(x) plotAgeWeekDeath(cdcList_20210911,
keyStates=x,
diffTrend=TRUE,
printPlots=FALSE,
returnData=TRUE,
returnPlots=FALSE
)[["p2Full"]]
) %>%
bind_rows(.id="stateNum") %>%
mutate(state=tempStateList[as.integer(stateNum)])
tempState
## # A tibble: 74,747 x 7
## stateNum weekEnding year age deaths pred state
## <chr> <date> <fct> <fct> <dbl> <dbl> <chr>
## 1 1 2015-01-10 2015 Under 25 years 25 26.0 AL
## 2 1 2015-01-10 2015 25-44 years 67 52.8 AL
## 3 1 2015-01-10 2015 45-64 years 253 218. AL
## 4 1 2015-01-10 2015 65-74 years 202 190. AL
## 5 1 2015-01-10 2015 75-84 years 272 231. AL
## 6 1 2015-01-10 2015 85 years and older 320 241. AL
## 7 1 2015-01-17 2015 Under 25 years 28 26.0 AL
## 8 1 2015-01-17 2015 25-44 years 49 52.8 AL
## 9 1 2015-01-17 2015 45-64 years 256 218. AL
## 10 1 2015-01-17 2015 65-74 years 222 190. AL
## # ... with 74,737 more rows
# Plot the results
tempState %>%
filter() %>%
group_by(state, weekEnding, year) %>%
summarize(across(where(is.numeric), sum), .groups="drop") %>%
ggplot(aes(x=weekEnding)) +
geom_line(aes(y=deaths, color=year)) +
geom_line(aes(y=pred), lty=2) +
facet_wrap(~state, scales="free_y") +
labs(x=NULL,
y="Weekly all-cause deaths",
title="All-cause deaths per week by state",
subtitle="Dashed line is linear trend without seasonality using 2015-2019 data"
) +
lims(y=c(0, NA)) +
scale_color_discrete("")
The process is updated so that data can be calculated for those groups of state-age where there is sufficient data:
# Grid of all valid combinations of weekEnding-state-age
validGrid <- expand.grid(weekEnding=sort(unique(cdcList_20210911$cdc$weekEnding)),
state=c(state.abb, "DC"),
age=unique(cdcList_20210911$cdc$age),
stringsAsFactors=FALSE,
KEEP.OUT.ATTRS=FALSE
) %>%
tibble::as_tibble() %>%
mutate(year=factor(lubridate::epiyear(weekEnding)))
# Check that there are no records missing from validGrid
cdcList_20210911$cdc %>%
anti_join(validGrid, by=c("weekEnding", "state", "age", "year"))
## # A tibble: 0 x 12
## # ... with 12 variables: fullState <chr>, weekEnding <date>, state <chr>,
## # year <fct>, week <int>, age <fct>, period <fct>, Type <chr>,
## # Suppress <chr>, n <int>, deaths <dbl>, Note <chr>
# Modified data to insert 0 deaths for any missing data
cdcMod <- cdcList_20210911$cdc %>%
select(state, age, weekEnding, year, deaths) %>%
right_join(validGrid, by=c("weekEnding", "state", "age", "year")) %>%
mutate(deaths=ifelse(is.na(deaths), 0, deaths))
cdcMod
## # A tibble: 105,570 x 5
## state age weekEnding year deaths
## <chr> <fct> <date> <fct> <dbl>
## 1 AL Under 25 years 2015-01-10 2015 25
## 2 AL 25-44 years 2015-01-10 2015 67
## 3 AL 45-64 years 2015-01-10 2015 253
## 4 AL 65-74 years 2015-01-10 2015 202
## 5 AL 75-84 years 2015-01-10 2015 272
## 6 AL 85 years and older 2015-01-10 2015 320
## 7 AL Under 25 years 2015-01-17 2015 28
## 8 AL 25-44 years 2015-01-17 2015 49
## 9 AL 45-64 years 2015-01-17 2015 256
## 10 AL 65-74 years 2015-01-17 2015 222
## # ... with 105,560 more rows
# Create data for each state as before
modStateList <- sort(c(state.abb, "DC"))
modState <- lapply(modStateList,
FUN=function(x) plotAgeWeekDeath(list("cdc"=cdcMod),
keyStates=x,
diffTrend=TRUE,
printPlots=FALSE,
returnData=TRUE,
returnPlots=FALSE
)[["p2Full"]]
) %>%
bind_rows(.id="stateNum") %>%
mutate(state=modStateList[as.integer(stateNum)])
modState
## # A tibble: 105,570 x 7
## stateNum weekEnding year age deaths pred state
## <chr> <date> <fct> <fct> <dbl> <dbl> <chr>
## 1 1 2015-01-10 2015 Under 25 years 0 0.0340 AK
## 2 1 2015-01-10 2015 25-44 years 0 2.13 AK
## 3 1 2015-01-10 2015 45-64 years 22 22.9 AK
## 4 1 2015-01-10 2015 65-74 years 12 14.9 AK
## 5 1 2015-01-10 2015 75-84 years 13 13.8 AK
## 6 1 2015-01-10 2015 85 years and older 15 13.1 AK
## 7 1 2015-01-17 2015 Under 25 years 0 0.0344 AK
## 8 1 2015-01-17 2015 25-44 years 0 2.14 AK
## 9 1 2015-01-17 2015 45-64 years 22 22.9 AK
## 10 1 2015-01-17 2015 65-74 years 19 15.0 AK
## # ... with 105,560 more rows
# Plot the results
modState %>%
filter() %>%
group_by(state, weekEnding, year) %>%
summarize(across(where(is.numeric), sum), .groups="drop") %>%
ggplot(aes(x=weekEnding)) +
geom_line(aes(y=deaths, color=year)) +
geom_line(aes(y=pred), lty=2) +
facet_wrap(~state, scales="free_y") +
labs(x=NULL,
y="Weekly all-cause deaths",
title="All-cause deaths per week by state",
subtitle="Dashed line is linear trend without seasonality using 2015-2019 data"
) +
lims(y=c(0, NA)) +
scale_color_discrete("")
Plots of proportion of deaths by year are created:
modState %>%
group_by(state, year, age) %>%
summarize(across(where(is.numeric), sum), .groups="drop") %>%
ggplot(aes(x=year, y=deaths)) +
geom_col(aes(fill=age), position="fill") +
facet_wrap(~state) +
labs(x=NULL, y="Proportion of deaths", title="Proportion of deaths by age group and state") +
scale_fill_discrete("") +
theme(axis.text.x = element_text(angle = 90))
modState %>%
group_by(state, year, age) %>%
summarize(across(where(is.numeric), sum), .groups="drop") %>%
filter(year=="2020") %>%
ggplot(aes(x=fct_reorder(state, deaths, sum))) +
geom_col(aes(y=deaths/1000, fill=age), position="stack") +
geom_point(data=filter(readFromRDS("cdc_daily_210902")$dfPerCapita, date=="2020-12-13"),
aes(x=state, y=tot_deaths/1000)
) +
labs(x=NULL,
y="2020 deaths (000)",
title="2020 deaths by age group and state",
subtitle="2020 reported coronavirus deaths are points, all-cause deaths are filled bars"
) +
scale_fill_discrete("") +
coord_flip()
Shares of total deaths by coronavirus are plotted on top of the age distribution:
p12020 <- modState %>%
group_by(state, year, age) %>%
summarize(across(where(is.numeric), sum), .groups="drop") %>%
filter(year=="2020") %>%
inner_join(select(filter(readFromRDS("cdc_daily_210902")$dfPerCapita, date=="2020-12-31"), state, tot_deaths),
by="state"
) %>%
ggplot(aes(x=fct_reorder2(state, .x=tot_deaths, .y=deaths, .fun=function(x, y) -mean(x)/sum(y)))) +
geom_col(aes(y=deaths/1000, fill=age), position="fill") +
geom_point(data=~summarize(group_by(., state), tot_deaths=mean(tot_deaths), deaths=sum(deaths),
.groups="drop"
),
aes(x=state, y=tot_deaths/deaths)
) +
labs(x=NULL,
y="Proportion of 2020 deaths (000)",
title="2020 deaths by age group and state",
subtitle="2020 reported coronavirus deaths are points, all-cause 2020 deaths are filled bars"
) +
scale_fill_discrete("") +
coord_flip()
p12020
The plots are repeated for YTD July 2021:
p22021 <- modState %>%
filter(weekEnding <= as.Date("2021-07-31")) %>%
group_by(state, year, age) %>%
summarize(across(where(is.numeric), sum), .groups="drop") %>%
filter(year=="2021") %>%
inner_join(readFromRDS("cdc_daily_210902")$dfPerCapita %>%
filter(date %in% as.Date(c("2020-12-31", "2021-07-31"))) %>%
group_by(state) %>%
summarize(tot_deaths=max(tot_deaths)-min(tot_deaths)),
by="state"
) %>%
ggplot(aes(x=fct_reorder2(state, .x=tot_deaths, .y=deaths, .fun=function(x, y) -mean(x)/sum(y)))) +
geom_col(aes(y=deaths/1000, fill=age), position="fill") +
geom_point(data=~summarize(group_by(., state), tot_deaths=mean(tot_deaths), deaths=sum(deaths),
.groups="drop"
),
aes(x=state, y=tot_deaths/deaths)
) +
labs(x=NULL,
y="Proportion of YTD July 2021 deaths (000)",
title="YTD July 2021 deaths by age group and state",
subtitle="2021 reported coronavirus deaths are points, all-cause 2021 deaths are filled bars"
) +
scale_fill_discrete("") +
coord_flip()
p22021
gridExtra::grid.arrange(p12020 + theme(legend.position="bottom") + geom_hline(yintercept=0.125, lty=2),
p22021 + theme(legend.position="bottom") + geom_hline(yintercept=0.125, lty=2),
nrow=1
)
The latest CDC all-cause death data are downloaded and processed:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20211105.csv"
cdcList_20211105 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=40,
lst=readFromRDS("cdc_daily_211104"),
stateNoCheck=c("NC"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-10-09
##
##
## *** Data suppression checks ***
##
## Rows in states to be checked that have NA deaths or a note for suppression:
## [1] state weekEnding year week age Suppress deaths
## <0 rows> (or 0-length row.names)
##
##
## Problems by state:
## # A tibble: 1 x 5
## noCheck state problem n deaths
## <lgl> <chr> <lgl> <int> <dbl>
## 1 TRUE NC TRUE 6 NA
## Warning in max(.): no non-missing arguments to max; returning -Inf
##
##
## There are 0 rows with errors; maximum for any given state is -Inf errors
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 98,081
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 11493 0 397614
## 2 25-44 years 14710 0 996414
## 3 45-64 years 17974 0 3854160
## 4 65-74 years 17966 0 3863612
## 5 75-84 years 17974 0 4743692
## 6 85 years and older 17964 0 6066231
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14363 0 2691168
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14404 0 2801988
## 4 2015-2019 2018 Predicted (weighted) 14398 0 2830357
## 5 2015-2019 2019 Predicted (weighted) 14415 0 2844026
## 6 2020 2020 Predicted (weighted) 14822 0 3432831
## 7 2021 2021 Predicted (weighted) 11236 0 2598140
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72023 0 13890752
## 2 2020 <NA> 14822 0 3432831
## 3 2021 <NA> 11236 0 2598140
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72023 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 10985 0 2.09e6
## 3 2020 Data in recent weeks are incomplete. Only~ 4 0 1.17e2
## 4 2020 Data in recent weeks are incomplete. Only~ 2723 0 1.10e6
## 5 2020 Weighted numbers of deaths are 20% or mor~ 279 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 104 0 4.10e4
## 7 2020 <NA> 727 0 1.38e5
## 8 2021 Data in recent weeks are incomplete. Only~ 10080 0 2.20e6
## 9 2021 Data in recent weeks are incomplete. Only~ 68 0 1.17e4
## 10 2021 Data in recent weeks are incomplete. Only~ 1088 0 3.91e5
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w40.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w40.pdf
##
## Returning plot outputs to the main log file
All-cause deaths by year and state are plotted, with comparison to the 2015-2019 trendline:
# Grid of all valid combinations of weekEnding-state-age
validGrid <- expand.grid(weekEnding=sort(unique(cdcList_20211105$cdc$weekEnding)),
state=c(state.abb, "DC"),
age=unique(cdcList_20211105$cdc$age),
stringsAsFactors=FALSE,
KEEP.OUT.ATTRS=FALSE
) %>%
tibble::as_tibble() %>%
mutate(year=factor(lubridate::epiyear(weekEnding)))
# Check that there are no records missing from validGrid
cdcList_20211105$cdc %>%
anti_join(validGrid, by=c("weekEnding", "state", "age", "year"))
## # A tibble: 0 x 12
## # ... with 12 variables: fullState <chr>, weekEnding <date>, state <chr>,
## # year <fct>, week <int>, age <fct>, period <fct>, Type <chr>,
## # Suppress <chr>, n <int>, deaths <dbl>, Note <chr>
# Modified data to insert 0 deaths for any missing data
cdcMod <- cdcList_20211105$cdc %>%
select(state, age, weekEnding, year, deaths) %>%
right_join(validGrid, by=c("weekEnding", "state", "age", "year")) %>%
mutate(deaths=ifelse(is.na(deaths), 0, deaths))
cdcMod
## # A tibble: 108,018 x 5
## state age weekEnding year deaths
## <chr> <fct> <date> <fct> <dbl>
## 1 AL Under 25 years 2015-01-10 2015 25
## 2 AL 25-44 years 2015-01-10 2015 67
## 3 AL 45-64 years 2015-01-10 2015 253
## 4 AL 65-74 years 2015-01-10 2015 202
## 5 AL 75-84 years 2015-01-10 2015 272
## 6 AL 85 years and older 2015-01-10 2015 320
## 7 AL Under 25 years 2015-01-17 2015 28
## 8 AL 25-44 years 2015-01-17 2015 49
## 9 AL 45-64 years 2015-01-17 2015 256
## 10 AL 65-74 years 2015-01-17 2015 222
## # ... with 108,008 more rows
# Create data for each state as before
modStateList <- sort(c(state.abb, "DC"))
modState <- lapply(modStateList,
FUN=function(x) plotAgeWeekDeath(list("cdc"=cdcMod),
keyStates=x,
diffTrend=TRUE,
printPlots=FALSE,
returnData=TRUE,
returnPlots=FALSE
)[["p2Full"]]
) %>%
bind_rows(.id="stateNum") %>%
mutate(state=modStateList[as.integer(stateNum)])
modState
## # A tibble: 108,018 x 7
## stateNum weekEnding year age deaths pred state
## <chr> <date> <fct> <fct> <dbl> <dbl> <chr>
## 1 1 2015-01-10 2015 Under 25 years 0 0.0340 AK
## 2 1 2015-01-10 2015 25-44 years 0 2.13 AK
## 3 1 2015-01-10 2015 45-64 years 22 22.9 AK
## 4 1 2015-01-10 2015 65-74 years 12 14.9 AK
## 5 1 2015-01-10 2015 75-84 years 13 13.8 AK
## 6 1 2015-01-10 2015 85 years and older 15 13.1 AK
## 7 1 2015-01-17 2015 Under 25 years 0 0.0344 AK
## 8 1 2015-01-17 2015 25-44 years 0 2.14 AK
## 9 1 2015-01-17 2015 45-64 years 22 22.9 AK
## 10 1 2015-01-17 2015 65-74 years 19 15.0 AK
## # ... with 108,008 more rows
# Plot the results
modState %>%
filter() %>%
group_by(state, weekEnding, year) %>%
summarize(across(where(is.numeric), sum), .groups="drop") %>%
ggplot(aes(x=weekEnding)) +
geom_line(aes(y=deaths, color=year)) +
geom_line(aes(y=pred), lty=2) +
facet_wrap(~state, scales="free_y") +
labs(x=NULL,
y="Weekly all-cause deaths",
title="All-cause deaths per week by state",
subtitle="Dashed line is linear trend without seasonality using 2015-2019 data"
) +
lims(y=c(0, NA)) +
scale_color_discrete("")
All-cause deaths by age are also plotted:
plotAgeWeekDeath(cdcList_20211105, diffTrend=TRUE, printPlots=TRUE, returnData=FALSE, returnPlots=FALSE)
Deaths by location data are also updated and explored:
deathAgeLoc <- "./RInputFiles/Coronavirus/COvID_deaths_age_place_20211108.csv"
if (!file.exists(deathAgeLoc)) {
fileDownload(fileName=deathAgeLoc,
url="https://data.cdc.gov/api/views/4va6-ph5s/rows.csv?accessType=DOWNLOAD"
)
} else {
cat("\nFile already exists, not downloading\n")
}
##
## File already exists, not downloading
deathAge_20211108_raw <- fileRead(deathAgeLoc, col_types="cccciiccccddddddc")
glimpse(deathAge_20211108_raw)
## Rows: 109,350
## Columns: 17
## $ `Data as of` <chr> "11/03/2021", "11/03/2021",~
## $ `Start Date` <chr> "01/01/2020", "01/01/2020",~
## $ `End Date` <chr> "10/30/2021", "10/30/2021",~
## $ Group <chr> "By Total", "By Total", "By~
## $ Year <int> NA, NA, NA, NA, NA, NA, NA,~
## $ Month <int> NA, NA, NA, NA, NA, NA, NA,~
## $ `HHS Region` <chr> "0", "0", "0", "0", "0", "0~
## $ State <chr> "United States", "United St~
## $ `Place of Death` <chr> "Total - All Places of Deat~
## $ `Age group` <chr> "All Ages", "0-17 years", "~
## $ `COVID-19 Deaths` <dbl> 748163, 576, 4288, 12614, 3~
## $ `Total Deaths` <dbl> 6086297, 60810, 115989, 168~
## $ `Pneumonia Deaths` <dbl> 670896, 1069, 3912, 10531, ~
## $ `Pneumonia and COVID-19 Deaths` <dbl> 383847, 148, 2069, 6561, 16~
## $ `Influenza Deaths` <dbl> 9406, 189, 149, 325, 511, 2~
## $ `Pneumonia, Influenza, or COVID-19 Deaths` <dbl> 1043261, 1686, 6268, 16884,~
## $ Footnote <chr> NA, NA, NA, NA, NA, NA, NA,~
deathAge_20211108_conv <- deathAge_20211108_raw %>%
colRenamer(vecRename=c("Data as of"="asofDate",
"Start Date"="startDate",
"End Date"="endDate",
"HHS Region"="HHSRegion",
"Place of Death"="deathPlace",
"Age group"="Age",
"COVID-19 Deaths"="covidDeaths",
"Total Deaths"="totalDeaths",
"Pneumonia Deaths"="pneumoDeaths",
"Pneumonia and COVID-19 Deaths"="pneumoCovidDeaths",
"Influenza Deaths"="fluDeaths",
"Pneumonia, Influenza, or COVID-19 Deaths"="pnemoFluCovidDeaths"
)
) %>%
colMutater(selfList=list("asofDate"=lubridate::mdy, "startDate"=lubridate::mdy, "endDate"=lubridate::mdy))
glimpse(deathAge_20211108_conv)
## Rows: 109,350
## Columns: 17
## $ asofDate <date> 2021-11-03, 2021-11-03, 2021-11-03, 2021-11-03, 2~
## $ startDate <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2~
## $ endDate <date> 2021-10-30, 2021-10-30, 2021-10-30, 2021-10-30, 2~
## $ Group <chr> "By Total", "By Total", "By Total", "By Total", "B~
## $ Year <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ Month <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ HHSRegion <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", ~
## $ State <chr> "United States", "United States", "United States",~
## $ deathPlace <chr> "Total - All Places of Death", "Total - All Places~
## $ Age <chr> "All Ages", "0-17 years", "18-29 years", "30-39 ye~
## $ covidDeaths <dbl> 748163, 576, 4288, 12614, 30945, 135375, 169188, 1~
## $ totalDeaths <dbl> 6086297, 60810, 115989, 168821, 252614, 1028753, 1~
## $ pneumoDeaths <dbl> 670896, 1069, 3912, 10531, 24806, 120871, 157928, ~
## $ pneumoCovidDeaths <dbl> 383847, 148, 2069, 6561, 16650, 75645, 94263, 1021~
## $ fluDeaths <dbl> 9406, 189, 149, 325, 511, 2239, 2045, 2038, 1910, ~
## $ pnemoFluCovidDeaths <dbl> 1043261, 1686, 6268, 16884, 39538, 182471, 234556,~
## $ Footnote <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
# Combinations of startDate and endDate
deathAge_20211108_conv %>%
count(asofDate, startDate, endDate) %>%
ggplot(aes(y=startDate, x=endDate)) +
geom_point(aes(size=n)) +
facet_wrap(~asofDate) +
labs(x="Ending Date", y="Starting Date", title="Combinations of Start and End Date")
deathAge_20211108_conv %>%
count(Group, deathPlace, Age) %>%
ggplot(aes(x=Group, y=deathPlace)) +
geom_tile(aes(fill=n)) +
facet_wrap(~Age) +
labs(x="Group", y="Place of Death", title="Combinations of Age, Place of Death, and Group")
deathState <- deathAge_20211108_conv %>%
filter(Group=="By Total", deathPlace=="Total - All Places of Death", Age=="All Ages") %>%
group_by(State) %>%
summarize(across(where(is.numeric), sum, na.rm=TRUE)) %>%
mutate(abb=state.abb[match(State, state.name)])
deathState %>% filter(is.na(abb))
## # A tibble: 4 x 10
## State Year Month covidDeaths totalDeaths pneumoDeaths pneumoCovidDeat~
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 District of~ 0 0 1597 12879 1977 1300
## 2 New York Ci~ 0 0 30430 134170 18820 11433
## 3 Puerto Rico 0 0 3160 58488 8147 2295
## 4 United Stat~ 0 0 748163 6086297 670896 383847
## # ... with 3 more variables: fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # abb <chr>
deathBase <- deathState %>%
select(State, covidDeaths, totalDeaths) %>%
mutate(noncovid=covidDeaths/totalDeaths) %>%
filter(!(State %in% c("United States", "Puerto Rico"))) %>%
pivot_longer(-c(State)) %>%
ggplot(aes(x=fct_reorder(State, value, max), y=value/1000)) +
coord_flip() +
theme(legend.position="bottom")
deathBase +
geom_col(data=~filter(., name=="totalDeaths"), aes(fill="All")) +
geom_col(data=~filter(., name=="covidDeaths"), aes(fill="COVID")) +
scale_fill_manual("Type", breaks=c("COVID", "All"), labels=c("COVID", "All"), values=c("red", "black")) +
labs(title="Deaths 2020-present by state", x=NULL, y="Deaths (000s)")
deathBase +
geom_col(data=~filter(., name=="noncovid"), aes(y=value), position="identity") +
labs(x=NULL, y=NULL, title="Proportion of deaths from COVID")
# Add the state abbreviation
deathAge_20211108_conv <- deathAge_20211108_conv %>%
mutate(abb=c(state.abb, "DC", "US")[match(State, c(state.name, "District of Columbia", "United States"))])
# Get a list of the possible variables
allCheckVars <- names(deathAge_20211108_conv) %>%
setdiff(deathAge_20211108_conv %>% head(1) %>% select_if(is.numeric) %>% names()) %>%
setdiff(c("Footnote", "abb", "HHSRegion"))
# Test for each variable in allCheckVars
subMap <- c("State"="United States", "Age"="All Ages", "deathPlace"="Total - All Places of Death")
lapply(c("State", "deathPlace", "Age"),
FUN=function(x) deathAge_20211108_conv %>%
select(-Year, -Month) %>%
checkSubTotals(checkByVars=allCheckVars %>% setdiff(x), subVar=x, subVarTotal=unname(subMap[x])) %>%
checkNumbers(byVars=allCheckVars, keyVar=x)
)
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 1,240 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-11-03 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-11-03 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-11-03 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 115
## 5 2021-11-03 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-11-03 2020-01-01 2021-10-30 By To~ Unite~ Other 40-4~ pneum~ 260
## # ... with 1,230 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
## [[1]]
## # A tibble: 1,240 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-11-03 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-11-03 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-11-03 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 115
## 5 2021-11-03 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-11-03 2020-01-01 2021-10-30 By To~ Unite~ Other 40-4~ pneum~ 260
## # ... with 1,230 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## [[2]]
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## [[3]]
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
The above are converted to functional form:
processAllCauseLocation <- function(loc,
url="https://data.cdc.gov/api/views/4va6-ph5s/rows.csv?accessType=DOWNLOAD",
col_types="cccciiccccddddddc",
vecRename=c("Data as of"="asofDate",
"Start Date"="startDate",
"End Date"="endDate",
"HHS Region"="HHSRegion",
"Place of Death"="deathPlace",
"Age group"="Age",
"COVID-19 Deaths"="covidDeaths",
"Total Deaths"="totalDeaths",
"Pneumonia Deaths"="pneumoDeaths",
"Pneumonia and COVID-19 Deaths"="pneumoCovidDeaths",
"Influenza Deaths"="fluDeaths",
"Pneumonia, Influenza, or COVID-19 Deaths"="pnemoFluCovidDeaths"
),
selfList=list("asofDate"=lubridate::mdy,
"startDate"=lubridate::mdy,
"endDate"=lubridate::mdy
),
dir="./RInputFiles/Coronavirus/",
dlData=isFALSE(file.exists(paste0(dir, loc))),
allCheckVars=NULL,
subMap=c("State"="United States",
"Age"="All Ages",
"deathPlace"="Total - All Places of Death"
),
createPlot5=TRUE
) {
# FUNCTION ARGUMENTS:
# loc: the location of the CDC all-cause death by location file
# url: the location of the all-cause death by location data
# col_types: the column types for the data in loc
# vecRename: vector for renaming columns in raw data
# selfList: list for colMutater() for the data from loc
# dir: the directory for the downloaded data in loc
# dlData: boolean, should data be downloaded?
# allCheckVars: variable list to be checked
# subMap: subsets to be checked for summation to the whole
# createPlot5: boolean, should comparisons of totals and sum of subtotals be created
# Step 0: Download the data if requested
if (dlData) fileDownload(fileName=paste0(dir, loc), url=url)
# Step 1: Read the CSV data
deathLoc_raw <- fileRead(paste0(dir, loc), col_types=col_types)
# Step 2: Rename variables for easier interpretation, convert the dates, add the state abbreviation
deathLoc_conv <- deathLoc_raw %>%
colRenamer(vecRename=vecRename) %>%
colMutater(selfList=selfList) %>%
mutate(abb=c(state.abb, "DC", "US")[match(State, c(state.name, "District of Columbia", "United States"))])
# Step 3: Plots for combinations included
p1 <- deathLoc_conv %>%
count(asofDate, startDate, endDate) %>%
ggplot(aes(y=startDate, x=endDate)) +
geom_point(aes(size=n)) +
facet_wrap(~asofDate) +
labs(x="Ending Date", y="Starting Date", title="Combinations of Start and End Date")
p2 <- deathLoc_conv %>%
count(Group, deathPlace, Age) %>%
ggplot(aes(x=Group, y=deathPlace)) +
geom_tile(aes(fill=n)) +
facet_wrap(~Age) +
labs(x="Group", y="Place of Death", title="Combinations of Age, Place of Death, and Group")
gridExtra::grid.arrange(p1, p2, ncol=1)
# Step 4: Deaths by state
dfTemp <- deathLoc_conv %>%
filter(Group=="By Total", deathPlace=="Total - All Places of Death", Age=="All Ages") %>%
group_by(State, abb) %>%
summarize(across(where(is.numeric), sum, na.rm=TRUE))
cat("\nStates without abbreviations\n")
print(dfTemp %>% filter(is.na(abb)))
# Step 4a: Plots
deathBase <- dfTemp %>%
select(State, covidDeaths, totalDeaths) %>%
mutate(noncovid=covidDeaths/totalDeaths) %>%
filter(!(State %in% c("United States", "Puerto Rico"))) %>%
pivot_longer(-c(State)) %>%
ggplot(aes(x=fct_reorder(State, value, max), y=value/1000)) +
coord_flip() +
theme(legend.position="bottom")
p3 <- deathBase +
geom_col(data=~filter(., name=="totalDeaths"), aes(fill="All")) +
geom_col(data=~filter(., name=="covidDeaths"), aes(fill="COVID")) +
scale_fill_manual("Type", breaks=c("COVID", "All"), labels=c("COVID", "All"), values=c("red", "black")) +
labs(title="Deaths 2020-present by state", x=NULL, y="Deaths (000s)")
p4 <- deathBase +
geom_col(data=~filter(., name=="noncovid"), aes(y=value), position="identity") +
labs(x=NULL, y=NULL, title="Proportion of deaths from COVID")
gridExtra::grid.arrange(p3, p4, nrow=1)
# Step 5: Check for alignment by variable combinations (run only if createPlot5 is TRUE)
if(isTRUE(createPlot5)) {
# Create allCheckVars if not passed
if (is.null(allCheckVars)) {
allCheckVars <- deathLoc_conv %>%
select(-where(is.numeric)) %>%
names %>%
setdiff(c("Footnote", "abb", "HHSRegion"))
}
# Run for the key subsets
lapply(names(subMap),
FUN=function(x) deathLoc_conv %>%
select(-Year, -Month) %>%
checkSubTotals(checkByVars=allCheckVars %>% setdiff(x),
subVar=x,
subVarTotal=unname(subMap[x])
) %>%
checkNumbers(byVars=allCheckVars, keyVar=x)
)
}
# Return a list of the key datasets
list(deathLoc_raw=deathLoc_raw, deathLoc_conv=deathLoc_conv)
}
# Test the function
testList_20211108 <- processAllCauseLocation("COvID_deaths_age_place_20211108.csv")
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
##
## States without abbreviations
## # A tibble: 2 x 10
## # Groups: State [2]
## State abb Year Month covidDeaths totalDeaths pneumoDeaths pneumoCovidDeat~
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 New Y~ <NA> 0 0 30430 134170 18820 11433
## 2 Puert~ <NA> 0 0 3160 58488 8147 2295
## # ... with 2 more variables: fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 1,240 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-11-03 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-11-03 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-11-03 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 115
## 5 2021-11-03 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-11-03 2020-01-01 2021-10-30 By To~ Unite~ Other 40-4~ pneum~ 260
## # ... with 1,230 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
identical(deathAge_20211108_raw, testList_20211108$deathLoc_raw)
## [1] TRUE
identical(deathAge_20211108_conv, testList_20211108$deathLoc_conv)
## [1] TRUE
Deaths by state are compared between CDC files, using October 31, 2021 as the cutoff:
# Create summary by state and year-month
death_sum_20211108 <- testList_20211108$deathLoc_conv %>%
filter(!is.na(Year), !is.na(Month), deathPlace=="Total - All Places of Death", Age=="All Ages") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(State, abb, ym, where(is.numeric), -Year, -Month) %>%
pivot_longer(-c(State, abb, ym)) %>%
arrange(State, abb, name, ym) %>%
group_by(State, abb, name) %>%
mutate(cumValue=cumsum(ifelse(is.na(value), 0, value))) %>%
ungroup() %>%
mutate(date=lubridate::ceiling_date(ym, unit="month")-lubridate::days(1))
# Create summary from state-level file
death_daily_211104 <- readFromRDS("cdc_daily_211104")$dfPerCapita %>%
select(date, abb=state, tot_deaths) %>%
mutate(Year=lubridate::year(date), Month=lubridate::month(date)) %>%
group_by(Year, Month) %>%
filter(date==max(date)) %>%
ungroup()
# Create a plot for evolution of United States
death_sum_20211108 %>%
filter(abb=="US", name=="covidDeaths", ym <= "2021-10-31") %>%
ggplot(aes(x=date)) +
geom_line(aes(y=cumValue/1000, color="blue"), size=2) +
geom_point(data=summarize(group_by(filter(death_daily_211104, date <= "2021-10-31"), date),
tot_deaths=sum(tot_deaths, na.rm=TRUE)
),
aes(y=tot_deaths/1000, color="green"),
size=3
) +
labs(x="End of month", y="Cumulative Deaths (000)", title="Cumulative COVID Deaths (000) in US by source") +
scale_color_manual("Source", labels=c("Summed\nstates", "Summed\nsubtotals"), values=c("green", "blue"))
# Comparison of totals by state
plot_cum1021 <- death_sum_20211108 %>%
filter(abb %in% c(state.abb, "DC"), name=="covidDeaths", date == "2021-10-31") %>%
select(abb, cumValue) %>%
inner_join(select(filter(death_daily_211104, date == "2021-10-31"), abb, tot_deaths), by=c("abb")) %>%
mutate(pctdiff=abs(tot_deaths-cumValue)/(tot_deaths+cumValue))
plot_cum1021 %>%
arrange(-pctdiff)
## # A tibble: 51 x 4
## abb cumValue tot_deaths pctdiff
## <chr> <dbl> <dbl> <dbl>
## 1 NY 27709 56328 0.341
## 2 DC 1597 1190 0.146
## 3 MA 14479 18996 0.135
## 4 MO 15016 12055 0.109
## 5 TN 19346 16351 0.0839
## 6 NE 3444 2975 0.0731
## 7 ND 2030 1762 0.0707
## 8 GA 25523 29065 0.0649
## 9 OH 27872 24527 0.0638
## 10 KY 11053 9766 0.0618
## # ... with 41 more rows
plot_cum1021 %>%
summarize(across(where(is.numeric), sum))
## # A tibble: 1 x 3
## cumValue tot_deaths pctdiff
## <dbl> <dbl> <dbl>
## 1 717549 740363 2.11
plot_cum1021 %>%
ggplot(aes(x=fct_reorder(abb, cumValue))) +
geom_col(aes(y=cumValue/1000), fill="lightblue") +
geom_point(aes(y=tot_deaths/1000), size=3) +
coord_flip() +
labs(x=NULL,
y="Cumulative Deaths (000)",
title="Cumulative COVID Deaths (000) in US as of 2021-10-31",
subtitle="Filled bars are summed subtotals, points are from CDC daily"
)
Breakdowns of deaths by age and place are also explored:
deathAllData <- testList_20211108$deathLoc_conv %>%
filter(deathPlace=="Total - All Places of Death")
deathAllData
## # A tibble: 12,150 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 3 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 4 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 5 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 6 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 7 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 8 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 9 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 10 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 4 Alab~ Total - A~
## # ... with 12,140 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
# Proportions of death by age and cause
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Total") %>%
select(Age, where(is.numeric), -Year, -Month) %>%
pivot_longer(-Age) %>%
ggplot() +
geom_col(aes(x=name, y=value, fill=fct_rev(Age)), position="fill") +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by cause (2020-October 2021)") +
scale_fill_discrete("Age")
# Proportions of death by age and month
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(Age, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(Age, ym)) %>%
ggplot() +
geom_col(aes(x=ym, y=value, fill=fct_rev(Age)), position="fill") +
facet_wrap(~name) +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by age and cause (2020-October 2021)") +
scale_fill_discrete("Age")
# Total death by age and month
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(Age, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(Age, ym)) %>%
filter(ym <= "2021-09-30") %>%
ggplot() +
geom_line(aes(x=ym, y=value, color=fct_rev(Age), group=Age)) +
facet_wrap(~name, scales="free_y") +
labs(x=NULL, y="Proportion of Deaths", title="Deaths by age and cause (2020-September 2021)") +
scale_color_discrete("Age")
deathPlaceData <- testList_20211108$deathLoc_conv %>%
filter(Age == "All Ages")
deathPlaceData
## # A tibble: 12,150 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Healthcar~
## 3 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Healthcar~
## 4 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Healthcar~
## 5 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Decedent'~
## 6 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Hospice f~
## 7 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Nursing h~
## 8 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Other
## 9 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Place of ~
## 10 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 4 Alab~ Total - A~
## # ... with 12,140 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
# Proportions of death by place and cause
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Total") %>%
select(deathPlace, where(is.numeric), -Year, -Month) %>%
pivot_longer(-deathPlace) %>%
ggplot() +
coord_flip() +
geom_col(aes(x=name, y=value, fill=fct_rev(deathPlace)), position="fill") +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by place (2020-October 2021)") +
scale_fill_discrete("Death\nPlace") +
theme(legend.position="bottom")
# Proportions of death by place and month
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(deathPlace, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(deathPlace, ym)) %>%
ggplot() +
geom_col(aes(x=ym, y=value, fill=fct_rev(deathPlace)), position="fill") +
facet_wrap(~name) +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by place and cause (2020-October 2021)") +
scale_fill_discrete("Death\nPlace") +
theme(legend.position="bottom")
# Total death by place and month
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(deathPlace, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(deathPlace, ym)) %>%
filter(ym <= "2021-09-30") %>%
ggplot() +
geom_line(aes(x=ym, y=value, color=fct_rev(deathPlace), group=deathPlace)) +
facet_wrap(~name, scales="free_y") +
labs(x=NULL, y="Proportion of Deaths", title="Deaths by place and cause (2020-September 2021)") +
scale_color_discrete("Death\nPlace")
Exploration of the place of death for COVID and non-COVID deaths is explored:
# Locations of death by age
tempPlotData <- testList_20211108$deathLoc_conv %>%
mutate(nonCovidDeaths=zeroNA(totalDeaths)-zeroNA(covidDeaths)) %>%
select(Group, startDate, endDate, State, deathPlace, Age, where(is.numeric), -Month, -Year) %>%
pivot_longer(where(is.numeric))
# Basic plotting data
p1 <- tempPlotData %>%
filter(name %in% c("covidDeaths", "nonCovidDeaths"),
State=="United States",
Group=="By Total"
) %>%
ggplot(aes(x=Age, y=value/1000)) +
coord_flip() +
scale_fill_discrete("") +
theme(legend.position="bottom") +
labs(x=NULL, y="Deaths (000)", title="United States deaths (2020 thru October 2021)")
# Overall deaths by age and type
p1a <- p1 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(fill=name),
position="stack"
)
# Proportion deaths by age and type
p1b <- p1 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death"),
aes(fill=fct_rev(name)),
position="fill"
) +
labs(y="Proportion of deaths")
gridExtra::grid.arrange(p1a, p1b, nrow=1)
# Overall deaths by age and type and location
p1 +
geom_col(data=~filter(., deathPlace!="Total - All Places of Death", Age != "All Ages"),
aes(fill=name),
position="stack"
) +
facet_wrap(~deathPlace)
# Proportion of deaths by age and type and location
p1 +
geom_col(data=~filter(., Age !="All Ages"),
aes(fill=fct_rev(name)),
position="fill"
) +
facet_wrap(~deathPlace) +
labs(y="Proportion of deaths") +
geom_hline(yintercept=0.25, lty=2)
The evolution by month is also explored:
# Basic plotting data
p2 <- tempPlotData %>%
filter(name %in% c("covidDeaths", "nonCovidDeaths"),
State=="United States",
Group=="By Month",
endDate <= "2021-10-31"
) %>%
ggplot(aes(x=fct_reorder(deathPlace, value, max), y=value/1000)) +
coord_flip() +
scale_fill_discrete("") +
theme(legend.position="bottom") +
labs(x=NULL, y="Deaths (000)", title="United States deaths (2020 thru October 2021)")
# Overall deaths by month and place
p2 +
geom_col(data=~filter(., deathPlace!="Total - All Places of Death", Age =="All Ages"),
aes(fill=name),
position="stack"
) +
facet_wrap(~endDate, nrow=3)
# Proportion of deaths by month and place
p2 +
geom_col(data=~filter(., deathPlace!="Total - All Places of Death", Age =="All Ages"),
aes(fill=name),
position="fill"
) +
facet_wrap(~endDate, nrow=3) +
labs(y="Proportion of deaths")
# Overall deaths by month and age
p2 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(x=Age, fill=name),
position="stack"
) +
facet_wrap(~endDate, nrow=3)
# Proportion of deaths by month and age
p2 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(x=Age, fill=name),
position="fill"
) +
facet_wrap(~endDate, nrow=3) +
labs(y="Proportion of deaths")
There are clear patterns in COVID deaths by each of age, month, and place, with proportions evolving over time.
The functional form is used for comparing deaths by state between CDC files:
compareCDCDeaths <- function(lstLoc,
lstState,
thruDate,
keyDeathPlaces=c("Total - All Places of Death"),
keyAges=c("All Ages"),
returnData=FALSE
) {
# FUNCTION ARGUMENTS
# lstLoc: list containing the processed death-location data, with sub-list "deathLoc_conv"
# lstState: list containing processed CDC COVID death data by state, with sub-list "dfPerCapita"
# thruDate: character, formatted as YYYY-MM-DD
# keyDeathPlaces: places of death to include from lstLoc
# keyAges: ages to include from lstLoc
# returnData: boolean, should data be returned?
# Create summary by state and year-month
death_sum <- lstLoc[["deathLoc_conv"]] %>%
filter(!is.na(Year),
!is.na(Month),
deathPlace %in% all_of(keyDeathPlaces),
Age %in% all_of(keyAges)
) %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(State, abb, ym, where(is.numeric), -Year, -Month) %>%
pivot_longer(-c(State, abb, ym)) %>%
arrange(State, abb, name, ym) %>%
group_by(State, abb, name) %>%
mutate(cumValue=cumsum(ifelse(is.na(value), 0, value))) %>%
ungroup() %>%
mutate(date=lubridate::ceiling_date(ym, unit="month")-lubridate::days(1))
# Create summary from state-level file
death_daily <- lstState[["dfPerCapita"]] %>%
select(date, abb=state, tot_deaths) %>%
mutate(Year=lubridate::year(date), Month=lubridate::month(date)) %>%
group_by(Year, Month) %>%
filter(date==max(date)) %>%
ungroup()
# Create a plot for evolution of United States
p1 <- death_sum %>%
filter(abb=="US", name=="covidDeaths", ym <= thruDate) %>%
ggplot(aes(x=date)) +
geom_line(aes(y=cumValue/1000, color="blue"), size=2) +
geom_point(data=summarize(group_by(filter(death_daily, date <= thruDate), date),
tot_deaths=sum(tot_deaths, na.rm=TRUE)
),
aes(y=tot_deaths/1000, color="green"),
size=3
) +
labs(x="End of month",
y="Cumulative COVID Deaths (000)",
title="Cumulative COVID Deaths (000) in US by source"
) +
scale_color_manual("Source",
labels=c("Summed\nCDC Location", "Summed\nCDC Daily"),
values=c("green", "blue")
)
print(p1)
# Comparison of totals by state
plot_cum <- death_sum %>%
filter(abb %in% c(state.abb, "DC"),
name=="covidDeaths",
date == thruDate
) %>%
select(abb, cumValue) %>%
inner_join(select(filter(death_daily, date == thruDate), abb, tot_deaths), by=c("abb")) %>%
mutate(pctdiff=abs(tot_deaths-cumValue)/(tot_deaths+cumValue))
plot_cum %>%
arrange(-pctdiff) %>%
print()
plot_cum %>%
summarize(across(where(is.numeric), sum)) %>%
print()
p2 <- plot_cum %>%
ggplot(aes(x=fct_reorder(abb, cumValue))) +
geom_col(aes(y=cumValue/1000), fill="lightblue") +
geom_point(aes(y=tot_deaths/1000), size=3) +
coord_flip() +
labs(x=NULL,
y="Cumulative Deaths (000)",
title=paste0("Cumulative COVID Deaths (000) in US as of ", thruDate),
subtitle="Filled bars are summed CDC location, points are from CDC daily"
)
print(p2)
if(isTRUE(returnData)) return(plot_cum)
}
compareCDCDeaths(lstLoc=testList_20211108, lstState=readFromRDS("cdc_daily_211104"), thruDate="2021-10-31")
## # A tibble: 51 x 4
## abb cumValue tot_deaths pctdiff
## <chr> <dbl> <dbl> <dbl>
## 1 NY 27709 56328 0.341
## 2 DC 1597 1190 0.146
## 3 MA 14479 18996 0.135
## 4 MO 15016 12055 0.109
## 5 TN 19346 16351 0.0839
## 6 NE 3444 2975 0.0731
## 7 ND 2030 1762 0.0707
## 8 GA 25523 29065 0.0649
## 9 OH 27872 24527 0.0638
## 10 KY 11053 9766 0.0618
## # ... with 41 more rows
## # A tibble: 1 x 3
## cumValue tot_deaths pctdiff
## <dbl> <dbl> <dbl>
## 1 717549 740363 2.11
A function for breakdowns of deaths by age and place are also explored:
plotDeathDetails <- function(lst,
keyVar,
timeLabel,
dfFilter=list(),
p1Include=list("State"="United States", "Group"="By Total"),
p1Exclude=list("Age"="All Ages"),
p2Include=list("State"="United States", "Group"="By Month"),
p2Exclude=p1Exclude,
p3Include=p2Include,
p3Exclude=p2Exclude,
p2PlotVars=c("totalDeaths", "covidDeaths", "fluDeaths"),
p3PlotVars=p2PlotVars,
legendLabel=keyVar,
legendPosition=NULL,
returnData=FALSE
) {
# FUNCTION ARGUMENTS:
# lst: a processed list file with sub-list "deathLoc_conv"
# keyVar: the key variable being explored (e.g., "Age" or "deathPlace")
# timeLabel: the label for the plot timing (e.g., "2020-October 2021")
# dfFilter: a list of format list("variable"=c("allowed values")) for filtering data to produce df
# p1Include: a list of format list("variable"=c("allowed values")) for filtering data to produce plot 1
# p1Exclude: a list of format list("variable"=c("disallowed values")) for filtering data to produce plot 1
# p2Include: a list of format list("variable"=c("allowed values")) for filtering data to produce plot 2
# p2Exclude: a list of format list("variable"=c("disallowed values")) for filtering data to produce plot 2
# p3Include: a list of format list("variable"=c("allowed values")) for filtering data to produce plot 3
# p3Exclude: a list of format list("variable"=c("disallowed values")) for filtering data to produce plot 3
# p2PlotVars: variables to include in the second plot
# p3PlotVars: variables to include in the third plot
# legendLabel: label to be used for the legend
# legendPosition: the position for the legend (NULL means leave defaults)
# returnData: boolean, should the data be returned?
# Create the data for processing
df <- lst[["deathLoc_conv"]] %>%
rowFilter(lstFilter=dfFilter)
# Create the first plot
p1 <- df %>%
rowFilter(lstFilter=p1Include, lstExclude=p1Exclude) %>%
select(all_of(keyVar), where(is.numeric), -Year, -Month) %>%
pivot_longer(!all_of(keyVar)) %>%
ggplot() +
geom_col(aes(x=name, y=value, fill=fct_rev(get(keyVar))), position="fill") +
labs(x=NULL,
y="Proportion of Deaths",
title=paste0("Proportion of deaths (", timeLabel, ")")
) +
scale_fill_discrete(legendLabel)
if(!is.null(legendPosition)) p1 <- p1 + theme(legend.position=legendPosition)
print(p1)
# Create the second plot
p2 <- df %>%
rowFilter(lstFilter=p2Include, lstExclude=p2Exclude) %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(all_of(keyVar), ym, all_of(p2PlotVars)) %>%
pivot_longer(!c(all_of(keyVar), ym)) %>%
ggplot() +
geom_col(aes(x=ym, y=value, fill=fct_rev(get(keyVar))), position="fill") +
facet_wrap(~name) +
labs(x=NULL,
y="Proportion of Deaths",
title=paste0("Proportion of deaths (", timeLabel, ")")
) +
scale_fill_discrete(legendLabel)
if(!is.null(legendPosition)) p2 <- p2 + theme(legend.position=legendPosition)
print(p2)
# Create the third plot
p3 <- df %>%
rowFilter(lstFilter=p2Include, lstExclude=p2Exclude) %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(all_of(keyVar), ym, all_of(p3PlotVars)) %>%
pivot_longer(!c(all_of(keyVar), ym)) %>%
ggplot() +
geom_line(aes(x=ym, y=value, color=fct_rev(get(keyVar)), group=get(keyVar))) +
facet_wrap(~name, scales="free_y") +
labs(x=NULL,
y="Deaths",
title=paste0("Deaths (", timeLabel, ")")
) +
scale_color_discrete(legendLabel)
if(!is.null(legendPosition)) p3 <- p3 + theme(legend.position=legendPosition)
print(p3)
# Return the data file if requested
if(isTRUE(returnData)) return(df)
}
# Plots by age cohort
plotDeathDetails(testList_20211108,
keyVar="Age",
timeLabel="2020-October 2021",
dfFilter=list("deathPlace"="Total - All Places of Death"),
returnData=TRUE
)
## # A tibble: 12,150 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 3 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 4 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 5 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 6 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 7 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 8 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 9 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 10 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 4 Alab~ Total - A~
## # ... with 12,140 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
# Plots by place of death cohort
plotDeathDetails(testList_20211108,
keyVar="deathPlace",
timeLabel="2020-October 2021",
dfFilter=list("Age"="All Ages"),
p1Exclude=list("deathPlace"="Total - All Places of Death"),
legendLabel="Death\nPlace",
legendPosition="bottom",
returnData=TRUE
)
## # A tibble: 12,150 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Healthcar~
## 3 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Healthcar~
## 4 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Healthcar~
## 5 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Decedent'~
## 6 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Hospice f~
## 7 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Nursing h~
## 8 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Other
## 9 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 0 Unit~ Place of ~
## 10 2021-11-03 2020-01-01 2021-10-30 By T~ NA NA 4 Alab~ Total - A~
## # ... with 12,140 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
A function for exploring the place of death for COVID and non-COVID deaths is created:
exploreDeathPlace <- function(lst,
timeLabel,
endMonth,
returnData=FALSE
) {
# FUNCTION ARGUMENTS:
# lst: a processed list file with sub-list "deathLoc_conv"
# timeLabel: the label for the plot timing (e.g., "2020-October 2021")
# endMonth: the last day of the final month to include in the monthly plots, formatted as "YYYY-MM-DD"
# returnData: boolean, should df be returned?
# Locations of death by age
df <- lst[["deathLoc_conv"]] %>%
mutate(nonCovidDeaths=zeroNA(totalDeaths)-zeroNA(covidDeaths)) %>%
select(Group, startDate, endDate, State, deathPlace, Age, where(is.numeric), -Month, -Year) %>%
pivot_longer(where(is.numeric))
# Basic plotting data
p1 <- df %>%
filter(name %in% c("covidDeaths", "nonCovidDeaths"),
State=="United States",
Group=="By Total"
) %>%
ggplot(aes(x=Age, y=value/1000)) +
coord_flip() +
scale_fill_discrete("") +
theme(legend.position="bottom") +
labs(x=NULL,
y="Deaths (000)",
title=paste0("United States deaths (", timeLabel, ")")
)
# Overall deaths by age and type
p1a <- p1 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(fill=name),
position="stack"
)
# Proportion deaths by age and type
p1b <- p1 +
geom_col(data=~filter(., deathPlace=="Total - All Places of Death"),
aes(fill=fct_rev(name)),
position="fill"
) +
labs(y="Proportion of deaths")
gridExtra::grid.arrange(p1a, p1b, nrow=1)
# Overall deaths by age and type and location
p1c <- p1 +
geom_col(data=~filter(., deathPlace!="Total - All Places of Death", Age != "All Ages"),
aes(fill=name),
position="stack"
) +
facet_wrap(~deathPlace)
# Proportion of deaths by age and type and location
p1d <- p1 +
geom_col(data=~filter(., Age !="All Ages"),
aes(fill=fct_rev(name)),
position="fill"
) +
facet_wrap(~deathPlace) +
labs(y="Proportion of deaths") +
geom_hline(yintercept=0.25, lty=2)
gridExtra::grid.arrange(p1c, p1d, nrow=1)
# Basic plotting data by month (not pivoted)
p2 <- df %>%
filter(name %in% c("covidDeaths", "nonCovidDeaths"),
State=="United States",
Group=="By Month",
endDate <= endMonth
) %>%
ggplot(aes(x=fct_reorder(deathPlace, value, max), y=value/1000)) +
scale_color_discrete("") +
theme(legend.position="bottom") +
labs(x=NULL,
y="Deaths (000)",
title=paste0("United States deaths (", timeLabel, ")")
)
# Basic plotting data by month (percent deaths from coVID)
p3 <- df %>%
filter(name %in% c("covidDeaths", "totalDeaths"),
State=="United States",
Group=="By Month",
endDate <= endMonth
) %>%
pivot_wider(names_from="name", values_from="value") %>%
mutate(pctCov=covidDeaths/totalDeaths)
# Overall deaths by month and place
p2a <- p2 +
geom_line(data=~filter(., deathPlace!="Total - All Places of Death", Age =="All Ages"),
aes(x=endDate, group=name, color=name)
) +
facet_wrap(~deathPlace[deathPlace!="Total - All Places of Death"], ncol=1)
# Proportion of deaths by month and place
p3a <- p3 %>%
filter(deathPlace!="Total - All Places of Death", Age =="All Ages") %>%
ggplot(aes(x=endDate, y=pctCov)) +
labs(x=NULL,
y="Proportion of deaths from COVID",
title=paste0("United States deaths (", timeLabel, ")")
) +
geom_col(fill="lightblue") +
geom_text(aes(label=paste0(round(100*pctCov), "%")),
vjust=0,
size=3
) +
facet_wrap(~deathPlace, ncol=1) +
lims(y=c(0, 1))
gridExtra::grid.arrange(p2a, p3a, nrow=1)
# Overall deaths by month and age
p2b <- p2 +
geom_line(data=~filter(., deathPlace=="Total - All Places of Death", Age !="All Ages"),
aes(x=endDate, group=name, color=name)
) +
facet_wrap(~Age[Age!="All Ages"], ncol=1)
# Proportion of deaths by month and age
p3b <- p3 %>%
filter(deathPlace=="Total - All Places of Death", Age !="All Ages") %>%
ggplot(aes(x=endDate, y=pctCov)) +
labs(x=NULL,
y="Proportion of deaths from COVID",
title=paste0("United States deaths (", timeLabel, ")")
) +
geom_col(fill="lightblue") +
geom_text(aes(label=paste0(round(100*pctCov), "%")),
vjust=0,
size=3
) +
facet_wrap(~Age, ncol=1) +
lims(y=c(0, 1))
gridExtra::grid.arrange(p2b, p3b, nrow=1)
# Return the processed data frame, if requested
if(isTRUE(returnData)) return(df)
}
exploreDeathPlace(testList_20211108,
timeLabel="2020-October 2021",
endMonth="2021-10-31",
returnData=TRUE
)
## # A tibble: 765,450 x 8
## Group startDate endDate State deathPlace Age name value
## <chr> <date> <date> <chr> <chr> <chr> <chr> <dbl>
## 1 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ All A~ covidDea~ 7.48e5
## 2 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ All A~ totalDea~ 6.09e6
## 3 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ All A~ pneumoDe~ 6.71e5
## 4 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ All A~ pneumoCo~ 3.84e5
## 5 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ All A~ fluDeaths 9.41e3
## 6 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ All A~ pnemoFlu~ 1.04e6
## 7 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ All A~ nonCovid~ 5.34e6
## 8 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ 0-17 ~ covidDea~ 5.76e2
## 9 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ 0-17 ~ totalDea~ 6.08e4
## 10 By To~ 2020-01-01 2021-10-30 United ~ Total - All Pl~ 0-17 ~ pneumoDe~ 1.07e3
## # ... with 765,440 more rows
The functions can be integrated:
# Load and process all-cause deaths data
testList_20211108 <- processAllCauseLocation("COvID_deaths_age_place_20211108.csv")
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
##
## States without abbreviations
## # A tibble: 2 x 10
## # Groups: State [2]
## State abb Year Month covidDeaths totalDeaths pneumoDeaths pneumoCovidDeat~
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 New Y~ <NA> 0 0 30430 134170 18820 11433
## 2 Puert~ <NA> 0 0 3160 58488 8147 2295
## # ... with 2 more variables: fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 1,240 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-11-03 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-11-03 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-11-03 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 115
## 5 2021-11-03 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-11-03 2020-01-01 2021-10-30 By To~ Unite~ Other 40-4~ pneum~ 260
## # ... with 1,230 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
# Compare all-cause deaths data between files
deathDelta_20211031 <- compareCDCDeaths(lstLoc=testList_20211108,
lstState=readFromRDS("cdc_daily_211104"),
thruDate="2021-10-31",
returnData=TRUE
)
## # A tibble: 51 x 4
## abb cumValue tot_deaths pctdiff
## <chr> <dbl> <dbl> <dbl>
## 1 NY 27709 56328 0.341
## 2 DC 1597 1190 0.146
## 3 MA 14479 18996 0.135
## 4 MO 15016 12055 0.109
## 5 TN 19346 16351 0.0839
## 6 NE 3444 2975 0.0731
## 7 ND 2030 1762 0.0707
## 8 GA 25523 29065 0.0649
## 9 OH 27872 24527 0.0638
## 10 KY 11053 9766 0.0618
## # ... with 41 more rows
## # A tibble: 1 x 3
## cumValue tot_deaths pctdiff
## <dbl> <dbl> <dbl>
## 1 717549 740363 2.11
# Plots by age cohort
deathAge_20211031 <- plotDeathDetails(testList_20211108,
keyVar="Age",
timeLabel="2020-October 2021",
dfFilter=list("deathPlace"="Total - All Places of Death"),
returnData=TRUE
)
# Plots by place of death cohort
deathPlace_20211031 <- plotDeathDetails(testList_20211108,
keyVar="deathPlace",
timeLabel="2020-October 2021",
dfFilter=list("Age"="All Ages"),
p1Exclude=list("deathPlace"="Total - All Places of Death"),
legendLabel="Death\nPlace",
legendPosition="bottom",
returnData=TRUE
)
exploreDeathPlace(testList_20211108,
timeLabel="2020-October 2021",
endMonth="2021-10-31",
returnData=FALSE
)
The process is converted to functional form:
analyzeAllCause <- function(loc,
cdcDailyList,
compareThruDate,
plotTitleTime=paste0("2020 through ", compareThruDate),
endMonth=NULL,
dlData=!file.exists(loc)
) {
# FUNCTION ARGUMENTS:
# loc: the location where detailed death data is stored (or should be downloaded to)
# will have ./RInputFiles/Coronavirus pre-pended to it
# cdcDailyList: list file containing processed CDC Daily files
# compareThruDate: date for comparisons of deaths between files
# plotTitleTime: text to be included in plot title to describe timing
# endMonth: ending month to be used for COVID vs non-COVID plots (NULL means infer from compareThruDate)
# dlData: boolean, should the data be downloaded?
# STEP 1: find ending month if passed as NULL
if(is.null(endMonth)) {
endMonth <- lubridate::floor_date(as.Date(compareThruDate)+lubridate::days(1), unit="month")
endMonth <- endMonth - lubridate::days(1)
}
# STEP 2: Load and process all-cause deaths data
allCauseList <- processAllCauseLocation(loc)
# STEP 3: Compare all-cause deaths between file
deathDelta <- compareCDCDeaths(lstLoc=allCauseList,
lstState=cdcDailyList,
thruDate=compareThruDate,
returnData=TRUE
)
# STEP 4: Plots by age cohort
deathAge <- plotDeathDetails(allCauseList,
keyVar="Age",
timeLabel=plotTitleTime,
dfFilter=list("deathPlace"="Total - All Places of Death"),
returnData=TRUE
)
# STEP 5: Plots by place of death cohort
deathPlace <- plotDeathDetails(allCauseList,
keyVar="deathPlace",
timeLabel=plotTitleTime,
dfFilter=list("Age"="All Ages"),
p1Exclude=list("deathPlace"="Total - All Places of Death"),
legendLabel="Death\nPlace",
legendPosition="bottom",
returnData=TRUE
)
# STEP 6: Additional plots for COVID vs non-COVID
exploreDeathPlace(allCauseList,
timeLabel=plotTitleTime,
endMonth=endMonth,
returnData=FALSE
)
# STEP 7: Return list
list(allCauseList=allCauseList,
deathDelta=deathDelta,
deathAge=deathAge,
deathPlace=deathPlace,
compareThruDate=compareThruDate,
endMonth=endMonth
)
}
allCauseTest <- analyzeAllCause(loc="COvID_deaths_age_place_20211108.csv",
cdcDailyList=readFromRDS("cdc_daily_211104"),
compareThruDate="2021-10-31"
)
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
##
## States without abbreviations
## # A tibble: 2 x 10
## # Groups: State [2]
## State abb Year Month covidDeaths totalDeaths pneumoDeaths pneumoCovidDeat~
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 New Y~ <NA> 0 0 30430 134170 18820 11433
## 2 Puert~ <NA> 0 0 3160 58488 8147 2295
## # ... with 2 more variables: fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 1,240 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-11-03 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-11-03 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-11-03 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 115
## 5 2021-11-03 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-11-03 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-11-03 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-11-03 2020-01-01 2021-10-30 By To~ Unite~ Other 40-4~ pneum~ 260
## # ... with 1,230 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
## # A tibble: 51 x 4
## abb cumValue tot_deaths pctdiff
## <chr> <dbl> <dbl> <dbl>
## 1 NY 27709 56328 0.341
## 2 DC 1597 1190 0.146
## 3 MA 14479 18996 0.135
## 4 MO 15016 12055 0.109
## 5 TN 19346 16351 0.0839
## 6 NE 3444 2975 0.0731
## 7 ND 2030 1762 0.0707
## 8 GA 25523 29065 0.0649
## 9 OH 27872 24527 0.0638
## 10 KY 11053 9766 0.0618
## # ... with 41 more rows
## # A tibble: 1 x 3
## cumValue tot_deaths pctdiff
## <dbl> <dbl> <dbl>
## 1 717549 740363 2.11
Deaths by year are compared between data sources:
df1 <- cdcList_20211105$cdc %>%
mutate(Year=as.integer(as.character(year))) %>%
group_by(Year) %>%
summarize(deaths=sum(deaths))
df2 <- allCauseTest$allCauseList$deathLoc_conv %>%
filter(Group=="By Year", State=="United States", Age=="All Ages", deathPlace=="Total - All Places of Death") %>%
select(Year, totalDeaths)
df1 %>%
left_join(df2, by="Year") %>%
ggplot(aes(x=Year)) +
geom_col(aes(y=totalDeaths/1000000), fill="lightblue") +
geom_line(aes(y=deaths/1000000), color="red") +
geom_text(aes(y=deaths/1000000 + 0.05, label=round(deaths/1000000, 2)), color="red", vjust=0) +
geom_text(aes(y=totalDeaths/2000000, label=round(totalDeaths/1000000, 2)), color="black", vjust=0) +
lims(y=c(0, NA)) +
labs(x=NULL,
y="Total Deaths (millions)",
title="Total US deaths by year and data source (millions)",
subtitle="2021 total is thru October 2021\nLine is CDC all-cause, bars are summed CDC death-place"
)
## Warning: Removed 5 rows containing missing values (position_stack).
## Warning: Removed 5 rows containing missing values (geom_text).
The CDC all-cause deaths are converted to monthly, assuming that days by week can be divided by 7 to create deaths by day:
df3 <- cdcList_20211105$cdc %>%
group_by(weekEnding) %>%
summarize(deathsPerDay=sum(deaths)/7)
df4 <- allCauseTest$allCauseList$deathLoc_conv %>%
filter(Group=="By Month", State=="United States", Age=="All Ages", deathPlace=="Total - All Places of Death") %>%
select(Year, Month, totalDeaths) %>%
mutate(date=as.Date(paste0(Year, "-", Month, "-1")))
map_dfr(.x=0:6, .f=function(x) mutate(df3, day=weekEnding-lubridate::days(x))) %>%
mutate(ym=customYYYYMM(day)) %>%
group_by(ym) %>%
summarize(deaths=sum(deathsPerDay)) %>%
mutate(date=as.Date(paste0(ym, "-01"))) %>%
ggplot(aes(x=date)) +
geom_line(aes(y=deaths/1000)) +
geom_line(data=df4, aes(y=totalDeaths/1000), color="red", linetype="dashed") +
lims(y=c(0, NA)) +
labs(x=NULL,
y="US Monthly Deaths (000)",
title="Comparison of US Monthly Deaths (000) by CDC data source",
subtitle="Solid black line is all-cause (2015-2021)\nDashed red line is summed age-place (2020-2021)"
)
The data appear to be largely internally consistent at the US national level. Trends are assessed using only the non-COVID data:
df4 <- allCauseTest$allCauseList$deathLoc_conv %>%
filter(Group=="By Month", State=="United States", Age=="All Ages", deathPlace=="Total - All Places of Death") %>%
select(Year, Month, totalDeaths, covidDeaths) %>%
mutate(date=as.Date(paste0(Year, "-", Month, "-1")), nonCovidDeaths=totalDeaths-covidDeaths)
map_dfr(.x=0:6, .f=function(x) mutate(df3, day=weekEnding-lubridate::days(x))) %>%
mutate(ym=customYYYYMM(day)) %>%
group_by(ym) %>%
summarize(deaths=sum(deathsPerDay)) %>%
mutate(date=as.Date(paste0(ym, "-01"))) %>%
filter(date <= "2021-09-30") %>%
ggplot(aes(x=date)) +
geom_line(aes(y=deaths/1000)) +
geom_line(data=filter(df4, date <= "2021-09-30"),
aes(y=nonCovidDeaths/1000),
color="red",
linetype="dashed"
) +
lims(y=c(0, NA)) +
labs(x=NULL,
y="US Monthly Deaths (000)",
title="Comparison of US Monthly Deaths (000) and non-COVID Deaths (000)",
subtitle="Solid black line is all-cause (2015-2021)\nDashed red line is summed age-place non-COVID (2020-2021)"
)
A more integrated plot is produced:
dfTest <- map_dfr(.x=0:6, .f=function(x) mutate(df3, day=weekEnding-lubridate::days(x))) %>%
mutate(ym=customYYYYMM(day)) %>%
group_by(ym) %>%
summarize(deaths=sum(deathsPerDay)) %>%
mutate(date=as.Date(paste0(ym, "-01"))) %>%
filter(date <= "2021-09-30") %>%
mutate(year=lubridate::year(date), month=factor(month.abb[lubridate::month(date)], levels=month.abb))
lmTest <- lm(deaths ~ year + month, data=filter(dfTest, date <= "2019-12-31"))
mapColor <- c("Actual\nall-cause"="red", "Actual\nnon-COVID"="black", "Trend from\n2015-2019"="red")
mapLineType <- c("Actual\nall-cause"="solid", "Actual\nnon-COVID"="solid", "Trend from\n2015-2019"="dotted")
mapSize <- c("Actual\nall-cause"=1.5, "Actual\nnon-COVID"=0.5, "Trend from\n2015-2019"=1)
dfTest %>%
mutate(pred=predict(lmTest, newdata=.)) %>%
ggplot(aes(x=date)) +
geom_line(aes(y=deaths/1000,
color="Actual\nall-cause",
linetype="Actual\nall-cause",
size="Actual\nall-cause"
)
) +
geom_line(data=filter(df4, date <= "2021-09-30"),
aes(y=nonCovidDeaths/1000,
color="Actual\nnon-COVID",
linetype="Actual\nnon-COVID",
size="Actual\nnon-COVID"
)
) +
geom_line(aes(y=pred/1000,
color="Trend from\n2015-2019",
linetype="Trend from\n2015-2019",
size="Trend from\n2015-2019"
)
) +
lims(y=c(0, NA)) +
labs(x=NULL, y="Monthly deaths (000)", title="US monthly deaths (000) by type and vs. 2015-2019 trend") +
scale_color_manual(name="Source:", values=mapColor) +
scale_linetype_manual(name="Source:", values=mapLineType) +
scale_size_manual(name="Source:", values=mapSize)
The process is converted to functional form:
deathTrends <- function(lstCDC,
lstCDCSub,
dateThru,
trendThru,
lstCDCFilter=list(),
lstCDCSubFilter=list("State"="United States", "Age"="All Ages"),
createPlot=TRUE,
mapColor=c("Actual\nall-cause"="red",
"Actual\nnon-COVID"="black",
"Trend from\n2015-2019"="red"
),
mapLineType=c("Actual\nall-cause"="solid",
"Actual\nnon-COVID"="solid",
"Trend from\n2015-2019"="dotted"
),
mapSize=c("Actual\nall-cause"=1.5,
"Actual\nnon-COVID"=0.5,
"Trend from\n2015-2019"=1
)
) {
# FUNCTION ARGUMENTS:
# lstCDC: processed list of CDC all-cause deaths
# lstCDCSub: processed list of CDC age-place-cause deaths
# dateThru: latest date for the analysis
# trendThru: latest date for the trend regression
# lstCDCFilter: named list with filtering criteria for lstCDC, passed to rowFilter()
# lstCDCSubFilter: named list with filtering criteria for lstCDCSub, passed to rowFilter()
# createPlot: boolean, should the plot be created (if FALSE, only the data is created ad returned)
# mapColor: mapping vector for plot color elements
# mapLineType: mapping vector for plot linetype elements
# mapSize: mapping vector for plot size elements
# Extrapolate all-cause deaths by month
dfAllCause <- lstCDC[["cdc"]] %>%
rowFilter(lstFilter=lstCDCFilter) %>%
group_by(weekEnding) %>%
summarize(deathsPerDay=sum(deaths)/7)
dfAllCause <- map_dfr(.x=0:6, .f=function(x) mutate(dfAllCause, day=weekEnding-lubridate::days(x))) %>%
mutate(ym=customYYYYMM(day)) %>%
group_by(ym) %>%
summarize(deaths=sum(deathsPerDay)) %>%
mutate(date=as.Date(paste0(ym, "-01"))) %>%
filter(date <= dateThru) %>%
mutate(year=lubridate::year(date),
month=factor(month.abb[lubridate::month(date)], levels=month.abb)
)
# Create the linear trend of all-cause deaths
lmAllCause <- lm(deaths ~ year + month, data=filter(dfAllCause, date <= trendThru))
# Create the integrated all-cause deaths with projections
dfAllCause <- dfAllCause %>%
mutate(pred=predict(lmAllCause, newdata=.))
# Extract the subset data
dfSubset <- lstCDCSub[["allCauseList"]][["deathLoc_conv"]] %>%
rowFilter(lstFilter=list("Group"="By Month", "deathPlace"="Total - All Places of Death")) %>%
rowFilter(lstFilter=lstCDCSubFilter) %>%
group_by(Year, Month) %>%
summarize(across(c(totalDeaths, covidDeaths), .fns=sum, na.rm=TRUE), .groups="drop") %>%
mutate(date=as.Date(paste0(Year, "-", Month, "-1")),
nonCovidDeaths=totalDeaths-covidDeaths
)
if(isTRUE(createPlot)) {
# Create and print the plot
p1 <- dfAllCause %>%
ggplot(aes(x=date)) +
geom_line(aes(y=deaths/1000,
color="Actual\nall-cause",
linetype="Actual\nall-cause",
size="Actual\nall-cause"
)
) +
geom_line(data=filter(dfSubset, date <= dateThru),
aes(y=nonCovidDeaths/1000,
color="Actual\nnon-COVID",
linetype="Actual\nnon-COVID",
size="Actual\nnon-COVID"
)
) +
geom_line(aes(y=pred/1000,
color="Trend from\n2015-2019",
linetype="Trend from\n2015-2019",
size="Trend from\n2015-2019"
)
) +
lims(y=c(0, NA)) +
labs(x=NULL,
y="Monthly deaths (000)",
title="US monthly deaths (000) by type and vs. 2015-2019 trend"
) +
scale_color_manual(name="Source:", values=mapColor) +
scale_linetype_manual(name="Source:", values=mapLineType) +
scale_size_manual(name="Source:", values=mapSize)
print(p1)
}
# Return key data components
list(dfAllCause=dfAllCause, dfSubset=dfSubset, lmAllCause=lmAllCause)
}
deathTrends(cdcList_20211105,
lstCDCSub = allCauseTest,
dateThru = "2021-09-30",
trendThru = "2019-12-31"
)
## $dfAllCause
## # A tibble: 81 x 6
## ym deaths date year month pred
## <chr> <dbl> <date> <dbl> <fct> <dbl>
## 1 2015-01 238563 2015-01-01 2015 Jan 250506.
## 2 2015-02 225723 2015-02-01 2015 Feb 223918.
## 3 2015-03 241354. 2015-03-01 2015 Mar 240273.
## 4 2015-04 223917. 2015-04-01 2015 Apr 222645.
## 5 2015-05 222797. 2015-05-01 2015 May 220754.
## 6 2015-06 210827. 2015-06-01 2015 Jun 209920.
## 7 2015-07 216130. 2015-07-01 2015 Jul 214852.
## 8 2015-08 213679 2015-08-01 2015 Aug 213638.
## 9 2015-09 209264. 2015-09-01 2015 Sep 209284.
## 10 2015-10 222420 2015-10-01 2015 Oct 223164.
## # ... with 71 more rows
##
## $dfSubset
## # A tibble: 22 x 6
## Year Month totalDeaths covidDeaths date nonCovidDeaths
## <int> <int> <dbl> <dbl> <date> <dbl>
## 1 2020 1 264568 5 2020-01-01 264563
## 2 2020 2 244827 20 2020-02-01 244807
## 3 2020 3 269692 7161 2020-03-01 262531
## 4 2020 4 322276 65477 2020-04-01 256799
## 5 2020 5 280412 38297 2020-05-01 242115
## 6 2020 6 250303 18006 2020-06-01 232297
## 7 2020 7 278855 31115 2020-07-01 247740
## 8 2020 8 277118 29878 2020-08-01 247240
## 9 2020 9 257040 19139 2020-09-01 237901
## 10 2020 10 273703 24909 2020-10-01 248794
## # ... with 12 more rows
##
## $lmAllCause
##
## Call:
## lm(formula = deaths ~ year + month, data = filter(dfAllCause,
## date <= trendThru))
##
## Coefficients:
## (Intercept) year monthFeb monthMar monthApr monthMay
## -7359002 3776 -26589 -10234 -27861 -29753
## monthJun monthJul monthAug monthSep monthOct monthNov
## -40587 -35654 -36868 -41223 -27342 -28396
## monthDec
## -8911
deathTrends(cdcList_20211105,
lstCDCSub = allCauseTest,
lstCDCFilter=list("state"=c("NY")),
lstCDCSubFilter=list("Age"="All Ages", "State"=c("New York", "New York City")),
dateThru = "2021-09-30",
trendThru = "2019-12-31"
)
## $dfAllCause
## # A tibble: 81 x 6
## ym deaths date year month pred
## <chr> <dbl> <date> <dbl> <fct> <dbl>
## 1 2015-01 13674 2015-01-01 2015 Jan 14171.
## 2 2015-02 13324 2015-02-01 2015 Feb 12627.
## 3 2015-03 13944 2015-03-01 2015 Mar 13457.
## 4 2015-04 12837. 2015-04-01 2015 Apr 12522.
## 5 2015-05 12486. 2015-05-01 2015 May 12295.
## 6 2015-06 11644. 2015-06-01 2015 Jun 11708.
## 7 2015-07 12106. 2015-07-01 2015 Jul 12059.
## 8 2015-08 11969. 2015-08-01 2015 Aug 11956.
## 9 2015-09 11636. 2015-09-01 2015 Sep 11670.
## 10 2015-10 12882. 2015-10-01 2015 Oct 12773.
## # ... with 71 more rows
##
## $dfSubset
## # A tibble: 22 x 6
## Year Month totalDeaths covidDeaths date nonCovidDeaths
## <int> <int> <dbl> <dbl> <date> <dbl>
## 1 2020 1 14385 0 2020-01-01 14385
## 2 2020 2 13198 0 2020-02-01 13198
## 3 2020 3 18828 2883 2020-03-01 15945
## 4 2020 4 41465 22189 2020-04-01 19276
## 5 2020 5 17823 5562 2020-05-01 12261
## 6 2020 6 12751 1119 2020-06-01 11632
## 7 2020 7 12676 437 2020-07-01 12239
## 8 2020 8 12463 253 2020-08-01 12210
## 9 2020 9 12474 219 2020-09-01 12255
## 10 2020 10 13320 437 2020-10-01 12883
## # ... with 12 more rows
##
## $lmAllCause
##
## Call:
## lm(formula = deaths ~ year + month, data = filter(dfAllCause,
## date <= trendThru))
##
## Coefficients:
## (Intercept) year monthFeb monthMar monthApr monthMay
## -181965.25 97.34 -1544.31 -714.20 -1648.91 -1875.69
## monthJun monthJul monthAug monthSep monthOct monthNov
## -2463.09 -2112.26 -2214.57 -2501.17 -1397.94 -1658.34
## monthDec
## -715.89
deathTrends(cdcList_20211105,
lstCDCSub = allCauseTest,
lstCDCFilter=list("age"=c("75-84 years", "85 years and older")),
lstCDCSubFilter=list("Age"=c("75-84 years", "85 years and over"), "State"="United States"),
dateThru = "2021-09-30",
trendThru = "2019-12-31"
)
## $dfAllCause
## # A tibble: 81 x 6
## ym deaths date year month pred
## <chr> <dbl> <date> <dbl> <fct> <dbl>
## 1 2015-01 139755 2015-01-01 2015 Jan 142613.
## 2 2015-02 129294 2015-02-01 2015 Feb 125991.
## 3 2015-03 135823. 2015-03-01 2015 Mar 133894.
## 4 2015-04 124560. 2015-04-01 2015 Apr 122884.
## 5 2015-05 121968. 2015-05-01 2015 May 120095.
## 6 2015-06 113895. 2015-06-01 2015 Jun 112770.
## 7 2015-07 116777. 2015-07-01 2015 Jul 114917.
## 8 2015-08 114749. 2015-08-01 2015 Aug 114393.
## 9 2015-09 112808. 2015-09-01 2015 Sep 112947.
## 10 2015-10 121273. 2015-10-01 2015 Oct 121953.
## # ... with 71 more rows
##
## $dfSubset
## # A tibble: 22 x 6
## Year Month totalDeaths covidDeaths date nonCovidDeaths
## <int> <int> <dbl> <dbl> <date> <dbl>
## 1 2020 1 145443 2 2020-01-01 145441
## 2 2020 2 134508 8 2020-02-01 134500
## 3 2020 3 148010 3607 2020-03-01 144403
## 4 2020 4 180499 38888 2020-04-01 141611
## 5 2020 5 151014 23524 2020-05-01 127490
## 6 2020 6 130249 10012 2020-06-01 120237
## 7 2020 7 144350 16181 2020-07-01 128169
## 8 2020 8 145219 16172 2020-08-01 129047
## 9 2020 9 136030 10775 2020-09-01 125255
## 10 2020 10 147333 15205 2020-10-01 132128
## # ... with 12 more rows
##
## $lmAllCause
##
## Call:
## lm(formula = deaths ~ year + month, data = filter(dfAllCause,
## date <= trendThru))
##
## Coefficients:
## (Intercept) year monthFeb monthMar monthApr monthMay
## -3784430 1949 -16622 -8719 -19729 -22518
## monthJun monthJul monthAug monthSep monthOct monthNov
## -29843 -27696 -28220 -29666 -20660 -20377
## monthDec
## -8432
The data outputs are used to calculate excess deaths as both 1) non-COVID vs baseline, and 2) COVID:
# Get the data
deathAllList <- deathTrends(cdcList_20211105,
lstCDCSub = allCauseTest,
dateThru = "2021-09-30",
trendThru = "2019-12-31",
createPlot=FALSE
)
# Create the integrated file
deathAllList$dfAllCause %>%
select(date, allCause=deaths, pred) %>%
inner_join(select(deathAllList$dfSubset, date, nonCovidDeaths, covidDeaths), by="date") %>%
mutate(deltaAllCause=allCause-covidDeaths-nonCovidDeaths,
deltaPred=nonCovidDeaths-pred
) %>%
pivot_longer(-c(date)) %>%
arrange(date) %>%
group_by(name) %>%
mutate(cumValue=cumsum(value)) %>%
ggplot(aes(x=date, y=cumValue/1000)) +
geom_line(data=~filter(., name=="deltaPred"), aes(color="nonCovid")) +
geom_text(data=~filter(., name=="deltaPred", date==max(date)),
aes(color="nonCovid", label=round(cumValue/1000)),
hjust=0
) +
geom_line(data=~filter(., name=="covidDeaths"), aes(color="covid")) +
geom_text(data=~filter(., name=="covidDeaths", date==max(date)),
aes(color="covid", label=round(cumValue/1000)),
hjust=0
) +
labs(x=NULL, y="Cumulative Deaths (000)", title="Cumulative excess deaths (000) since January 2020") +
scale_color_manual("Type:",
values=c("nonCovid"="black", "covid"="red"),
labels=c("nonCovid"="non-COVID\nvs. trend of\n2015-2019", "covid"="COVID")
) +
geom_hline(yintercept=0, lty=2)
This is converted to functional form:
makeCumulativeDeath <- function(lst,
divBy=1000,
yLab=NULL,
plotTitle=NULL,
plotSubtitle=NULL,
plotLabels=c("nonCovid"="non-COVID\nvs. trend of\n2015-2019", "covid"="COVID"),
returnData=FALSE
) {
# FUNCTION ARGUMENTS
# lst: a processed list file from deathTrends()
# divBy: divide death values by (1000 means plots will be in thousands)
# yLab: y-axis label (NULL means use divBy to create)
# plotTitle: title (NULL means use divBy and earliest date in inner_join to create)
# plotSubtitle: subtitle (NULL means none)
# plotLabels: mapping file of elements to labels for the plot legend
# returnData: boolean, should the data frame be returned?
# Create the data frame
df <- lst[["dfAllCause"]] %>%
select(date, allCause=deaths, pred) %>%
inner_join(select(lst[["dfSubset"]], date, nonCovidDeaths, covidDeaths), by="date") %>%
mutate(deltaAllCause=allCause-covidDeaths-nonCovidDeaths,
deltaPred=nonCovidDeaths-pred
) %>%
pivot_longer(-c(date)) %>%
arrange(date) %>%
group_by(name) %>%
mutate(cumValue=cumsum(value))
# Create the labels if needed
# Unit conversion text
if(divBy==1) xtraText <- ""
else if(isTRUE(all.equal(log10(divBy), round(log10(divBy)))))
xtraText <- paste0(" (", stringr::str_sub(as.character(divBy), start=2), ")")
else xtraText <- paste0(" (", as.character(round(divBy)), "s)")
# Earliest date
earlyDate <- format(min(df$date), "%B %Y")
# Replacement of NULL values
if (is.null(yLab)) yLab <- paste0("Cumulative Deaths", xtraText)
if (is.null(plotTitle)) plotTitle <- paste0("Cumulative change in deaths", xtraText, " since ", earlyDate)
# Create the plot
p1 <- df %>%
ggplot(aes(x=date, y=cumValue/divBy)) +
geom_line(data=~filter(., name=="deltaPred"), aes(color="nonCovid")) +
geom_text(data=~filter(., name=="deltaPred", date==max(date)),
aes(color="nonCovid", label=round(cumValue/divBy)),
hjust=0
) +
geom_line(data=~filter(., name=="covidDeaths"), aes(color="covid")) +
geom_text(data=~filter(., name=="covidDeaths", date==max(date)),
aes(color="covid", label=round(cumValue/divBy)),
hjust=0
) +
labs(x=NULL, y=yLab, title=plotTitle, subtitle=plotSubtitle) +
scale_color_manual("Type:", values=c("nonCovid"="black", "covid"="red"), labels=plotLabels) +
geom_hline(yintercept=0, lty=2)
print(p1)
if(isTRUE(returnData)) return(df)
}
makeCumulativeDeath(deathAllList)
makeCumulativeDeath(deathAllList, returnData=TRUE)
## # A tibble: 126 x 4
## # Groups: name [6]
## date name value cumValue
## <date> <chr> <dbl> <dbl>
## 1 2020-01-01 allCause 263344. 263344.
## 2 2020-01-01 pred 269389. 269389.
## 3 2020-01-01 nonCovidDeaths 264563 264563
## 4 2020-01-01 covidDeaths 5 5
## 5 2020-01-01 deltaAllCause -1224. -1224.
## 6 2020-01-01 deltaPred -4826. -4826.
## 7 2020-02-01 allCause 244263. 507606.
## 8 2020-02-01 pred 242800. 512188.
## 9 2020-02-01 nonCovidDeaths 244807 509370
## 10 2020-02-01 covidDeaths 20 25
## # ... with 116 more rows
deathTrends(cdcList_20211105,
lstCDCSub = allCauseTest,
lstCDCFilter=list("age"=c("75-84 years", "85 years and older")),
lstCDCSubFilter=list("Age"=c("75-84 years", "85 years and over"), "State"="United States"),
dateThru = "2021-09-30",
trendThru = "2019-12-31",
createPlot=FALSE
) %>%
makeCumulativeDeath(plotSubtitle="Age 75+")